pkgs <- c(
"here", # for file path management
"knitr", # for knitting rmarkdown
"kableExtra", # for table styling
"gmodels", # for crosstables
"hypr", # for calculating custom factor contrast codings
"lme4", # for mixed-effects models
"lmerTest", # significance tests for GLMMs
"optimx", # lme4 optimizer
"MuMIn", # for r.squaredGLMM
"car", # for recoding
"caret", # for evaluating model
"performance", # for evaluating model
"parameters", # for summarizing model
"ggeffects", # get partial effects from models
"Hmisc", # for `somers2()` function
"party", # for random forests
"permimp", # for faster varimp for {party}
"phangorn", # for neighborNets
"analogue", # for fusing distance matrices
"vegan", # for Mantel tests of distance matrix correlations
"remotes", # for installing from GitHub
"tidyverse", # for data wrangling and plotting: dplyr, ggplot2, etc.
"ggdendro", # for working with dendrograms in ggplot2
"patchwork", # for combining ggplots
"scales", # for adding scales to ggplots
"ggrepel", # for nicely annotating points with text
"extrafont" # for fonts in figures
)Comparative variation analysis
Complete code supplement
Introduction
This document contains the complete R code for the analyses presented in Szmrecsanyi & Grafmiller (2023) Comparative variation analysis: Grammatical alternations in World Englishes, Cambridge University Press. All code for the plots and analyses in the book can be found here. Some code is folded for readability, but you can see the code by clicking the “Code” buttons above the tables or figures, or clicking “Show All Code” in the </> code menu in the upper right. The full Quarto source code for this document can be found under “View Source”.
This document follows the chapter structure in the book, reconstructing all the quantitative analyses in Chapters 4-7. For readability, supplementary code for some parts has been moved to an Appendix (Chapter 6). All datasets and other files are available in the accompanying Open Science Framework repository: https://osf.io/5hvtw/.
R setup
Libraries
Load the libraries.
Install packages (if not not already installed) and load them.
install.packages(pkgs[!pkgs %in% installed.packages()[, 1]])
invisible(lapply(pkgs, FUN = function(x) suppressMessages(library(x, character.only = TRUE))))
rm(pkgs) # remove listInstall and load the {VADIS} package.
# remotes::install_github("jasongraf1/VADIS")
library(VADIS)Custom functions
The file custom_functions.R contains functions used for data wrangling and other things.
source("custom_functions.R")Plot styling
Set style for figures.
extrafont::loadfonts(device = "win")
text_cols <- c("black", "white")
theme_cup <- function() {
# Set the theme
theme_classic() %+replace%
theme(
strip.text.x = element_text(
size = 10, color = "black",
hjust = 0, face = "bold", family = "serif"
),
strip.background = element_blank(),
axis.text.x = element_text(color = "black", size = 12),
axis.text.y = element_text(color = "black", size = 12),
axis.title = element_text(size = rel(1.3)),
plot.title = element_text(
size = rel(1.4), color = "black", hjust = 0,
face = "bold",
margin = margin(b = 10)
),
axis.line = element_line(color = "black"),
panel.grid.minor = element_blank(),
text = element_text(family = "serif")
)
}Datasets
Load in the annotated datasets from the corpora and preprocess them. Descriptions of these datasets are found in Chapter 4 of the book, and further details are provided in the annotation manuals contained in the OSF repository: https://osf.io/5hvtw/.
For each we collapse infrequent levels of lexical items and texts, which will be included as random effects in the regression models.
Set variety labels order.
var_levs <- c("BrE", "CanE", "IrE", "NZE", "HKE", "IndE", "JamE", "PhlE", "SgE")Genitives
data_genitives <- readRDS(here("data", "data_gentives_CUP.rds")) |>
as.data.frame()
data_genitives$PorHeadLemma_pruned <- filter_infrequent(data_genitives$POR_HEAD_LEMMA, 20)
data_genitives$PumHeadLemma_pruned <- filter_infrequent(data_genitives$PUM_HEAD_LEMMA, 20)
data_genitives$FileID_pruned <- filter_infrequent(data_genitives$PUM_HEAD_LEMMA, 20)
data_genitives$variety_of_E <- factor(data_genitives$variety_of_E, levels = var_levs)Datives
data_datives <- here("data", "data_datives_CUP.rds") |>
readRDS()
data_datives$Verb_pruned <- filter_infrequent(data_datives$Verb, 20)
data_datives$RecHeadPlain_pruned <- filter_infrequent(data_datives$RecHeadPlain, 20)
data_datives$ThemeHeadPlain_pruned <- filter_infrequent(data_datives$ThemeHeadPlain, 20)
data_datives$FileID_pruned <- filter_infrequent(data_datives$FileID, 20) # less than 20 and the glmer model won't converge
data_datives$variety_of_E <- factor(data_datives$variety_of_E, levels = var_levs)Particle verbs
data_particle_verbs <- here("data", "data_particle_verbs_CUP.rds") |>
readRDS()
data_particle_verbs$VerbPart_pruned <- filter_infrequent(data_particle_verbs$VerbPart, 20)
data_particle_verbs$Verb_pruned <- filter_infrequent(data_particle_verbs$Verb, 20)
data_particle_verbs$FileID_pruned <- filter_infrequent(data_particle_verbs$FileID, 20)
data_particle_verbs$variety_of_E <- factor(data_particle_verbs$variety_of_E, levels = var_levs)Chapter 4
This chapter introduces the corpora used, and describes the data collection and annotation procedures.
Table 4.2 of the size fo the GloWbE corpus.
Code
data_glowbe_corpus <- read.csv(here("data", "glowbe_corpus_size.csv")) |>
rename(Country = "X")
data_glowbe_corpus |>
kable() |>
kable_styling()| Country | Code | Websites | Webpages | Words |
|---|---|---|---|---|
| UnitedStates | US | 82260 | 275156 | 386809355 |
| Canada | CA | 33776 | 135692 | 134765381 |
| GreatBritain | GB | 64351 | 381841 | 387615074 |
| Ireland | IE | 15840 | 102147 | 101029231 |
| Australia | AU | 28881 | 129244 | 148208169 |
| NewZealand | NZ | 14053 | 82679 | 81390476 |
| India | IN | 18618 | 113765 | 96430888 |
| SriLanka | LK | 4208 | 38389 | 46583115 |
| Pakistan | PK | 4955 | 42769 | 51367152 |
| Bangladesh | BD | 5712 | 45059 | 39658255 |
| Singapore | SG | 8339 | 45459 | 42974705 |
| Malaysia | MY | 8966 | 45601 | 42420168 |
| Philippines | PH | 10224 | 46342 | 43250093 |
| HongKong | HK | 8740 | 43936 | 40450291 |
| SouthAfrica | ZA | 10308 | 45264 | 45364498 |
| Nigeria | NG | 4516 | 37285 | 42646098 |
| Ghana | GH | 3616 | 47351 | 38768231 |
| Kenya | KE | 5193 | 45962 | 41069085 |
| Tanzania | TZ | 4575 | 41356 | 35169042 |
| Jamaica | JM | 3488 | 46748 | 39663666 |
| Total | 340619 | 1792045 | 1885632973 |
Tables 4.3-5 showign distribution of alternation variants by corpus and variety.
Code
gens_df_split <- split(data_genitives, data_genitives$Corpus)
ice_summary <- table(
gens_df_split$ICE$variety_of_E,
gens_df_split$ICE$Response
) |>
make_prop_table()
glowbe_summary <- table(
gens_df_split$GloWbE$variety_of_E,
gens_df_split$GloWbE$Response
) |>
make_prop_table()
cbind(ice_summary, glowbe_summary) |>
kable() |>
add_header_above(c(" ", "ICE" = 5, "GloWbE" = 5), bold = T) |>
kable_styling()| of | % | s | % | Total | of | % | s | % | Total | |
|---|---|---|---|---|---|---|---|---|---|---|
| BrE | 787 | 74.0 | 276 | 26.0 | 1063 | 388 | 70.7 | 161 | 29.3 | 549 |
| CanE | 631 | 63.2 | 368 | 36.8 | 999 | 207 | 57.3 | 154 | 42.7 | 361 |
| IrE | 629 | 70.8 | 259 | 29.2 | 888 | 294 | 69.2 | 131 | 30.8 | 425 |
| NZE | 770 | 70.6 | 321 | 29.4 | 1091 | 525 | 67.7 | 251 | 32.3 | 776 |
| HKE | 719 | 69.8 | 311 | 30.2 | 1030 | 320 | 64.1 | 179 | 35.9 | 499 |
| IndE | 904 | 79.5 | 233 | 20.5 | 1137 | 477 | 68.2 | 222 | 31.8 | 699 |
| JamE | 744 | 81.0 | 174 | 19.0 | 918 | 265 | 68.1 | 124 | 31.9 | 389 |
| PhlE | 898 | 74.8 | 302 | 25.2 | 1200 | 360 | 65.6 | 189 | 34.4 | 549 |
| SgE | 609 | 67.1 | 299 | 32.9 | 908 | 195 | 61.5 | 122 | 38.5 | 317 |
| Total | 6691 | NA | 2543 | NA | 9234 | 3031 | NA | 1533 | NA | 4564 |
Code
dats_df_split <- split(data_datives, data_datives$Corpus)
ice_summary <- table(
dats_df_split$ice$variety_of_E,
dats_df_split$ice$Response
) |>
make_prop_table()
glowbe_summary <- table(
dats_df_split$glowbe$variety_of_E,
dats_df_split$glowbe$Response
) |>
make_prop_table()
cbind(ice_summary, glowbe_summary) |>
kable() |>
add_header_above(c(" ", "ICE" = 5, "GloWbE" = 5), bold = T) |>
kable_styling()| do | % | pd | % | Total | do | % | pd | % | Total | |
|---|---|---|---|---|---|---|---|---|---|---|
| BrE | 640 | 73.2 | 234 | 26.8 | 874 | 292 | 65.8 | 152 | 34.2 | 444 |
| CanE | 671 | 72.9 | 250 | 27.1 | 921 | 297 | 68.1 | 139 | 31.9 | 436 |
| IrE | 645 | 74.4 | 222 | 25.6 | 867 | 279 | 69.2 | 124 | 30.8 | 403 |
| NZE | 736 | 71.3 | 296 | 28.7 | 1032 | 320 | 70.6 | 133 | 29.4 | 453 |
| HKE | 841 | 66.0 | 433 | 34.0 | 1274 | 288 | 58.3 | 206 | 41.7 | 494 |
| IndE | 608 | 56.3 | 471 | 43.7 | 1079 | 304 | 64.5 | 167 | 35.5 | 471 |
| JamE | 683 | 72.4 | 260 | 27.6 | 943 | 294 | 68.1 | 138 | 31.9 | 432 |
| PhlE | 661 | 65.5 | 348 | 34.5 | 1009 | 334 | 67.3 | 162 | 32.7 | 496 |
| SgE | 772 | 72.9 | 287 | 27.1 | 1059 | 322 | 66.5 | 162 | 33.5 | 484 |
| Total | 6257 | NA | 2801 | NA | 9058 | 2730 | NA | 1383 | NA | 4113 |
Code
pv_df_split <- split(data_datives, data_datives$Corpus)
ice_summary <- table(
pv_df_split$ice$variety_of_E,
pv_df_split$ice$Response
) |>
make_prop_table()
glowbe_summary <- table(
pv_df_split$glowbe$variety_of_E,
pv_df_split$glowbe$Response
) |>
make_prop_table()
cbind(ice_summary, glowbe_summary) |>
kable() |>
add_header_above(c(" ", "ICE" = 5, "GloWbE" = 5), bold = T) |>
kable_styling()| do | % | pd | % | Total | do | % | pd | % | Total | |
|---|---|---|---|---|---|---|---|---|---|---|
| BrE | 640 | 73.2 | 234 | 26.8 | 874 | 292 | 65.8 | 152 | 34.2 | 444 |
| CanE | 671 | 72.9 | 250 | 27.1 | 921 | 297 | 68.1 | 139 | 31.9 | 436 |
| IrE | 645 | 74.4 | 222 | 25.6 | 867 | 279 | 69.2 | 124 | 30.8 | 403 |
| NZE | 736 | 71.3 | 296 | 28.7 | 1032 | 320 | 70.6 | 133 | 29.4 | 453 |
| HKE | 841 | 66.0 | 433 | 34.0 | 1274 | 288 | 58.3 | 206 | 41.7 | 494 |
| IndE | 608 | 56.3 | 471 | 43.7 | 1079 | 304 | 64.5 | 167 | 35.5 | 471 |
| JamE | 683 | 72.4 | 260 | 27.6 | 943 | 294 | 68.1 | 138 | 31.9 | 432 |
| PhlE | 661 | 65.5 | 348 | 34.5 | 1009 | 334 | 67.3 | 162 | 32.7 | 496 |
| SgE | 772 | 72.9 | 287 | 27.1 | 1059 | 322 | 66.5 | 162 | 33.5 | 484 |
| Total | 6257 | NA | 2801 | NA | 9058 | 2730 | NA | 1383 | NA | 4113 |
Chapter 5
Chapter 5 focuses on analysis of individual alternations. For each alternation we fit single random forest model to the entire dataset and assess variable importance, then we examine variable importance for each variety individually based on separate by-variety forests. Finally, we fit a mixed-effects regression model to the entire dataset and examine significant interactions between internal constraints and variety.
5.2 Genitive alternation
Cross tabulate genitive variant and variety of English.
table(data_genitives$variety_of_E, data_genitives$Response) |>
make_prop_table() |>
kable() |>
kable_styling()| of | % | s | % | Total | |
|---|---|---|---|---|---|
| BrE | 1175 | 72.9 | 437 | 27.1 | 1612 |
| CanE | 838 | 61.6 | 522 | 38.4 | 1360 |
| IrE | 923 | 70.3 | 390 | 29.7 | 1313 |
| NZE | 1295 | 69.4 | 572 | 30.6 | 1867 |
| HKE | 1039 | 68.0 | 490 | 32.0 | 1529 |
| IndE | 1381 | 75.2 | 455 | 24.8 | 1836 |
| JamE | 1009 | 77.2 | 298 | 22.8 | 1307 |
| PhlE | 1258 | 71.9 | 491 | 28.1 | 1749 |
| SgE | 804 | 65.6 | 421 | 34.4 | 1225 |
| Total | 9722 | NA | 4076 | NA | 13798 |
Variable importance in global random forest model
Now fit conditional random forest to the full genitives dataset.
# formula
gen_f1 <- Response ~
PorAnimacyBin +
PorLength +
PumLength +
PorNPexprTypeBin +
PorFinalSibilancy +
PreviousChoice +
SemanticRelationBin +
PorHeadFreq
gen_f2 <- update(gen_f1, . ~ . + variety_of_E + Circle + GenreCoarse)
gen_f3 <- update(gen_f1, . ~ . + GenreCoarse)Run the forest and save output. {party} models are large and slow, so save output for reloading. Note that model outputs are not currently stored on OSF.
if (file.exists(here("model_output", "gen_forest1.rds"))) {
# load file if already run
gen_forest1 <- readRDS(here("model_output", "gen_forest1.rds"))
} else {
# if file doesn't already exist, run the analysis and save to file
gen_forest1 <- party::cforest(gen_f2, data_genitives)
saveRDS(gen_forest1, here("model_output", "gen_forest1.rds"))
}Get predictions, again saving output for quicker loading later.
if (file.exists(here("model_output", "gen_forest1_preds.rds"))) {
gen_forest1_preds <- readRDS(here("model_output", "gen_forest1_preds.rds"))
} else {
gen_forest1_preds <- unlist(party::treeresponse(gen_forest1))[c(FALSE, TRUE)]
saveRDS(gen_forest1_preds, here("model_output", "gen_forest1_preds.rds"))
}Check predictive performance.
data_genitives$fitted_crf1 <- gen_forest1_preds
data_genitives$predicted_crf1 <- as.factor(ifelse(data_genitives$fitted_crf1 >= .5, "s", "of"))
caret::confusionMatrix(data_genitives$predicted_crf1, data_genitives$Response,
mode = "prec_recall"
)Confusion Matrix and Statistics
Reference
Prediction of s
of 9126 1150
s 596 2926
Accuracy : 0.8735
95% CI : (0.8678, 0.879)
No Information Rate : 0.7046
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6835
Mcnemar's Test P-Value : < 2.2e-16
Precision : 0.8881
Recall : 0.9387
F1 : 0.9127
Prevalence : 0.7046
Detection Rate : 0.6614
Detection Prevalence : 0.7447
Balanced Accuracy : 0.8283
'Positive' Class : of
Check Area under ROC curve. This is alternatively known as the concordance index c (Harrell 2001:105, 257).
data.frame(
obs = data_genitives$Response,
pred = data_genitives$predicted_crf1,
of = 1 - data_genitives$fitted_crf1,
s = data_genitives$fitted_crf1
) |>
caret::twoClassSummary(lev = levels(data_genitives$Response)) ROC Sens Spec
0.9433525 0.9386957 0.7178606
Get (unconditional) variable importance scores, saving outputs again.
if (!file.exists(here("model_output", "gen_forest1_varimp.rds"))) {
gen_forest1_varimp <- permimp(gen_forest1, conditional = FALSE, progressBar = FALSE, asParty = TRUE)
saveRDS(gen_forest1_varimp, here("model_output", "gen_forest1_varimp.rds"))
} else {
gen_forest1_varimp <- here("model_output", "gen_forest1_varimp.rds") |>
readRDS()
}
invisible(gc(verbose = FALSE)) # free up usused memory (party RFs are big)Create Figure 5.1 ploting the variable importances for the entire genitives dataset.
data.frame(Varimp = gen_forest1_varimp$values) |>
# rename(Varimp = "gen_forest1_varimp$values") |>
rownames_to_column("Variable") |>
ggplot(aes(reorder(Variable, Varimp), Varimp)) +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = Variable, yend = 0), col = "black") +
geom_point(col = "black") +
coord_flip() +
theme_cup() +
labs(x = "", y = "Variable Importance")Variable importance in by-variety random forest variable importances
Next we fit RFs to each variety individually and calculate the variable importance. First split the data.
data_genitives_split <- split(data_genitives, data_genitives$variety_of_E)Now fit random forest for each variety.
if (file.exists(here("model_output", "gen_crf_list.rds"))) {
genitives_variety_forests <- readRDS(
here("model_output", "gen_crf_list.rds")
)
} else {
genitives_variety_forests <- lapply(
data_genitives_split, function(x) fit_party_crf(gen_f3, x)
)
saveRDS(genitives_variety_forests, here("model_output", "gen_crf_list.rds"))
}Now get the (unconditional) variable importance scores.
if (file.exists(here("model_output", "gen_crf_varimps.rds"))) {
genitives_variety_varimps <- readRDS(
here("model_output", "gen_crf_varimps.rds")
)
} else {
genitives_variety_varimps <- lapply(
genitives_variety_forests, function(x) get_party_varimp(x)$values
)
saveRDS(genitives_variety_varimps, here("model_output", "gen_crf_varimps.rds"))
}Create Figure 5.2 showing by-variety variable importance.
Code
preds <- genitives_variety_varimps[[1]] |>
sort() |>
names()
lapply(
seq_along(genitives_variety_varimps),
function(i) {
x <- genitives_variety_varimps[[i]]
x |>
as.data.frame() |>
rename(Varimp = 1) |>
rownames_to_column("Variable") |>
mutate(Variable = factor(Variable, levels = preds)) |>
# bind_rows() |>
# mutate(
# Variety = rep(names(genitives_variety_varimps), each = 9)
# ) |>
ggplot(aes(Variable, Varimp)) +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = Variable, yend = 0), col = "black") +
geom_point(col = "black") +
coord_flip() +
# facet_wrap(~Variety, ncol = 3) +
theme_cup() +
labs(x = "", y = "", title = names(genitives_variety_varimps)[i])
}
) |>
wrap_plots(ncol = 3) +
plot_annotation(caption = "Click to enlarge")Variety interactions in mixed-effects regression model
Next fit a generalized mixed-effects regression model to the genitives dataset.
Set reference levels.
data_genitives$Response <- relevel(data_genitives$Response, ref = "of")
data_genitives$PorAnimacyBin <- relevel(data_genitives$PorAnimacyBin, ref = "inanimate")
data_genitives$PorNPexprTypeBin <- relevel(data_genitives$PorNPexprTypeBin, ref = "other")
data_genitives$PorFinalSibilancy <- relevel(data_genitives$PorFinalSibilancy, ref = "final_sibilant_absent")Formula for genitives.
f_reg_gens <- Response ~
PorAnimacyBin +
PorLength +
PumLength +
PorNPexprTypeBin +
PorFinalSibilancy +
variety_of_E +
PorAnimacyBin:variety_of_E +
PorFinalSibilancy:variety_of_E +
(1 | FileID_pruned) +
(1 | PumHeadLemma_pruned)Set controls for glmer()
glmer_ctrl <- glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))Fit mixed-effects logistic regression model:
if (!file.exists(here("model_output", "gens_glmer1.rds"))) {
glmer_genitives1 <- glmer(f_reg_gens,
data = data_genitives,
family = binomial, control = glmer_ctrl
)
saveRDS(glmer_genitives1, here("model_output", "gens_glmer1.rds"))
} else {
glmer_genitives1 <- readRDS(here("model_output", "gens_glmer1.rds"))
}Check the model fit.
performance::model_performance(glmer_genitives1)# Indices of model performance
AIC | AICc | BIC | R2 (cond.) | R2 (marg.) | ICC | RMSE | Sigma | Log_loss | Score_log | Score_spherical
----------------------------------------------------------------------------------------------------------------------------
10403.316 | 10403.469 | 10644.349 | 0.738 | 0.629 | 0.293 | 0.340 | 1.000 | 0.363 | -Inf | 0.001
Precision and recall.
data_genitives$fitted_glmer1 <- fitted(glmer_genitives1)
data_genitives$predicted_glmer1 <- as.factor(ifelse(data_genitives$fitted_glmer1 >= .5, "s", "of"))
caret::confusionMatrix(data_genitives$predicted_glmer1, data_genitives$Response,
mode = "prec_recall"
)Confusion Matrix and Statistics
Reference
Prediction of s
of 8903 1427
s 819 2649
Accuracy : 0.8372
95% CI : (0.831, 0.8433)
No Information Rate : 0.7046
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5913
Mcnemar's Test P-Value : < 2.2e-16
Precision : 0.8619
Recall : 0.9158
F1 : 0.8880
Prevalence : 0.7046
Detection Rate : 0.6452
Detection Prevalence : 0.7487
Balanced Accuracy : 0.7828
'Positive' Class : of
Area under ROC curve.
data.frame(
obs = data_genitives$Response,
pred = data_genitives$predicted_glmer1,
of = 1 - data_genitives$fitted_glmer1,
s = data_genitives$fitted_glmer1
) |>
caret::twoClassSummary(lev = levels(data_genitives$Response)) ROC Sens Spec
0.8956718 0.9157581 0.6499019
Check multicollinearity. Lower scores are better
kappa_mer(glmer_genitives1)[1] 22.89032
performance::check_collinearity(glmer_genitives1)# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
PorLength 1.07 [ 1.05, 1.09] 1.04 0.93 [0.91, 0.95]
PumLength 1.06 [ 1.04, 1.08] 1.03 0.95 [0.93, 0.96]
PorNPexprTypeBin 1.07 [ 1.05, 1.09] 1.03 0.93 [0.92, 0.95]
Moderate Correlation
Term VIF VIF 95% CI Increased SE Tolerance
PorAnimacyBin 8.45 [ 8.19, 8.72] 2.91 0.12
PorFinalSibilancy 7.69 [ 7.45, 7.93] 2.77 0.13
Tolerance 95% CI
[0.11, 0.12]
[0.13, 0.13]
High Correlation
Term VIF VIF 95% CI Increased SE Tolerance
variety_of_E 44.44 [ 43.01, 45.93] 6.67 0.02
PorAnimacyBin:variety_of_E 223.87 [216.56, 231.44] 14.96 4.47e-03
PorFinalSibilancy:variety_of_E 29.37 [ 28.43, 30.35] 5.42 0.03
Tolerance 95% CI
[0.02, 0.02]
[0.00, 0.00]
[0.03, 0.04]
Check for overdispersion.
performance::check_overdispersion(glmer_genitives1)# Overdispersion test
dispersion ratio = 0.950
Pearson's Chi-Squared = 13080.509
p-value = 1
Create Table 5.2 of regression model effects in the genitive alternation.
parameters::model_parameters(glmer_genitives1, effects = "fixed", verbose = FALSE) |>
filter(p < 0.05) |>
as.data.frame() |>
select(c(1:3, 7, 9)) |>
knitr::kable(digits = 3) |>
kableExtra::kable_styling()| Parameter | Coefficient | SE | z | p |
|---|---|---|---|---|
| (Intercept) | -0.994 | 0.161 | -6.173 | 0.000 |
| PorAnimacyBinanimate | 2.057 | 0.160 | 12.879 | 0.000 |
| PorLength | -0.667 | 0.027 | -24.911 | 0.000 |
| PumLength | 0.582 | 0.023 | 25.718 | 0.000 |
| PorNPexprTypeBincommon_noun | -0.954 | 0.057 | -16.599 | 0.000 |
| PorFinalSibilancyfinal_sibilant_present | -0.951 | 0.198 | -4.803 | 0.000 |
| variety_of_ECanE | 0.452 | 0.137 | 3.313 | 0.001 |
| variety_of_EHKE | 0.744 | 0.128 | 5.828 | 0.000 |
| variety_of_ENZE | 0.506 | 0.129 | 3.926 | 0.000 |
| variety_of_EPhlE | 0.327 | 0.131 | 2.483 | 0.013 |
| variety_of_ESgE | 0.567 | 0.139 | 4.064 | 0.000 |
| PorAnimacyBinanimate:variety_of_EHKE | -0.714 | 0.233 | -3.058 | 0.002 |
| PorAnimacyBinanimate:variety_of_EPhlE | -1.056 | 0.210 | -5.037 | 0.000 |
| PorFinalSibilancyfinal_sibilant_present:variety_of_ECanE | -0.739 | 0.312 | -2.370 | 0.018 |
| PorFinalSibilancyfinal_sibilant_present:variety_of_EHKE | -1.354 | 0.339 | -3.989 | 0.000 |
| PorFinalSibilancyfinal_sibilant_present:variety_of_ESgE | -0.750 | 0.309 | -2.424 | 0.015 |
Now we plot the partial effects plot for PorAnimacyBin:variety_of_E (5.3) and PorFinalSibilancy:variety_of_E interactions:
Figure 5.3
Code
gen_var_anim_partial_eff <- effects::Effect(c("variety_of_E", "PorAnimacyBin"), glmer_genitives1) |>
as.data.frame() |>
mutate(
variety_of_E = factor(variety_of_E, levels = var_levs),
PorAnimacyBin = relevel(PorAnimacyBin, ref = "animate")
)
gen_var_anim_partial_eff |>
ggplot(aes(variety_of_E, fit, shape = PorAnimacyBin)) +
geom_vline(aes(xintercept = variety_of_E), color = "gray", linetype = "dashed") +
geom_errorbar(aes(ymin = lower, ymax = upper),
width = .2, position = position_dodge(width = .4)
) +
annotate("rect",
xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
alpha = 0.2, fill = c("gray")
) +
geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
labs(y = "Predicted probability of s-genitive", x = "Variety", title = "") +
scale_shape_manual(name = "PorAnimacyBin", values = c(21, 16)) +
theme_cup() +
theme( # plot.title = element_text(size=20, face="bold", vjust=2),
plot.title = element_blank(),
axis.title.x = element_text(size = 14, vjust = -0.5),
axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
axis.title.y = element_text(size = 14, vjust = 1.5),
legend.position = "top"
) +
scale_y_continuous(label = scales::percent)Figure 5.4
Code
gen_var_sib_partial_eff <- effects::Effect(c("variety_of_E", "PorFinalSibilancy"), glmer_genitives1) |>
as.data.frame() |>
mutate(
variety_of_E = factor(variety_of_E, levels = var_levs)
)
gen_var_sib_partial_eff |>
ggplot(aes(variety_of_E, fit, shape = PorFinalSibilancy)) +
geom_vline(aes(xintercept = variety_of_E), color = "gray", linetype = "dashed") +
geom_errorbar(aes(ymin = lower, ymax = upper),
width = .2, position = position_dodge(width = .4)
) +
annotate("rect",
xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
alpha = 0.2, fill = c("gray")
) +
geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
labs(y = "Predicted probability of s-genitive", x = "Variety", title = "") +
scale_shape_manual(name = "PorFinalSibilancy", values = c(16, 21)) +
theme_cup() +
theme( # plot.title = element_text(size=20, face="bold", vjust=2),
plot.title = element_blank(),
axis.title.x = element_text(size = 14, vjust = -0.5),
axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
axis.title.y = element_text(size = 14, vjust = 1.5),
legend.position = "top"
) +
scale_y_continuous(label = scales::percent)5.3 Dative alternation
Cross tabulate response and variety of English.
table(data_datives$variety_of_E, data_datives$Response) |>
make_prop_table() |>
kable() |>
kable_styling()| do | % | pd | % | Total | |
|---|---|---|---|---|---|
| BrE | 932 | 70.7 | 386 | 29.3 | 1318 |
| CanE | 968 | 71.3 | 389 | 28.7 | 1357 |
| IrE | 924 | 72.8 | 346 | 27.2 | 1270 |
| NZE | 1056 | 71.1 | 429 | 28.9 | 1485 |
| HKE | 1129 | 63.9 | 639 | 36.1 | 1768 |
| IndE | 912 | 58.8 | 638 | 41.2 | 1550 |
| JamE | 977 | 71.1 | 398 | 28.9 | 1375 |
| PhlE | 995 | 66.1 | 510 | 33.9 | 1505 |
| SgE | 1094 | 70.9 | 449 | 29.1 | 1543 |
| Total | 8987 | NA | 4184 | NA | 13171 |
Variable importance in global random forest model
Now fit random forest to the dataset. Again, save output for reloading.
# formula
dat_f1 <- Response ~
logWeightRatio +
RecPron +
ThemeBinComplexity +
ThemeHeadFreq +
ThemePron +
ThemeDefiniteness +
RecGivenness +
RecHeadFreq
dat_f2 <- update(dat_f1, . ~ . + Variety + Circle + GenreCoarse)
dat_f3 <- update(dat_f1, . ~ . + GenreCoarse)Run the forest.
if (!file.exists(here("model_output", "dat_forest1.rds"))) {
dat_forest1 <- party::cforest(dat_f2, data_datives)
saveRDS(dat_forest1, here("model_output", "dat_forest1.rds"))
} else {
dat_forest1 <- here("model_output", "dat_forest1.rds") |>
readRDS()
}
invisible(gc(verbose = FALSE)) # garbage collect to save RAMGet predictions.
if (!file.exists(here("model_output", "dat_forest1_preds.rds"))) {
dat_forest1_preds <- unlist(party::treeresponse(dat_forest1))[c(FALSE, TRUE)]
saveRDS(dat_forest1_preds, here("model_output", "dat_forest1_preds.rds"))
} else {
dat_forest1_preds <- here("model_output", "dat_forest1_preds.rds") |>
readRDS()
}
invisible(gc(verbose = FALSE))Check predictive performance.
data_datives$fitted_crf1 <- dat_forest1_preds
data_datives$predicted_crf1 <- as.factor(ifelse(data_datives$fitted_crf1 >= .5, "pd", "do"))
caret::confusionMatrix(data_datives$predicted_crf1, data_datives$Response,
mode = "prec_recall"
)Confusion Matrix and Statistics
Reference
Prediction do pd
do 6132 2951
pd 2855 1233
Accuracy : 0.5592
95% CI : (0.5507, 0.5677)
No Information Rate : 0.6823
P-Value [Acc > NIR] : 1.0000
Kappa : -0.0231
Mcnemar's Test P-Value : 0.2125
Precision : 0.6751
Recall : 0.6823
F1 : 0.6787
Prevalence : 0.6823
Detection Rate : 0.4656
Detection Prevalence : 0.6896
Balanced Accuracy : 0.4885
'Positive' Class : do
Area under ROC curve.
data.frame(
obs = data_datives$Response,
pred = data_datives$predicted_crf1,
do = 1 - data_datives$fitted_crf1,
pd = data_datives$fitted_crf1
) |>
caret::twoClassSummary(lev = levels(data_datives$Response)) ROC Sens Spec
0.5020562 0.6823189 0.2946941
Calculate (unconditional) variable importance.
if (!file.exists(here("model_output", "dat_forest1_varimp.rds"))) {
dat_forest1_varimp <- permimp(dat_forest1, conditional = FALSE, progressBar = FALSE, asParty = TRUE)
saveRDS(dat_forest1_varimp, here("model_output", "dat_forest1_varimp.rds"))
} else {
dat_forest1_varimp <- here("model_output", "dat_forest1_varimp.rds") |>
readRDS()
}Create Figure 5.5
dat_forest1_varimp$values |>
as.data.frame() |>
rename(Varimp = "dat_forest1_varimp$values") |>
rownames_to_column("Variable") |>
ggplot(aes(reorder(Variable, Varimp), Varimp)) +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = Variable, yend = 0), col = "black") +
geom_point(col = "black") +
coord_flip() +
theme_cup() +
labs(x = "", y = "Variable Importance")Variable importance in by-variety random forest models
Next we fit CRFs to each variety individually, and calculate variable importance per variety.
data_datives_split <- split(data_datives, data_datives$variety_of_E)Fit random forest for each variety.
if (file.exists(here("model_output", "datives_variety_forests.rds"))) {
datives_variety_forests <- readRDS(
here("model_output", "datives_variety_forests.rds")
)
} else {
datives_variety_forests <- lapply(
data_datives_split, function(x) fit_party_crf(dat_f3, x)
)
saveRDS(datives_variety_forests, here("model_output", "datives_variety_forests.rds"))
}Get the variable importances.
if (file.exists(here("model_output", "datives_variety_varimps.rds"))) {
datives_variety_varimps <- readRDS(
here("model_output", "datives_variety_varimps.rds")
)
} else {
datives_variety_varimps <- lapply(
datives_variety_forests, function(x) get_party_varimp(x)$values
)
saveRDS(datives_variety_varimps, here("model_output", "datives_variety_varimps.rds"))
}Create Figure 5.6
Code
preds <- datives_variety_varimps[[1]] |>
sort() |>
names()
lapply(
seq_along(datives_variety_varimps),
function(i) {
x <- datives_variety_varimps[[i]]
x |>
as.data.frame() |>
rename(Varimp = 1) |>
rownames_to_column("Variable") |>
mutate(Variable = factor(Variable, levels = preds)) |>
# bind_rows() |>
# mutate(
# Variety = rep(names(genitives_variety_varimps), each = 9)
# ) |>
ggplot(aes(Variable, Varimp)) +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = Variable, yend = 0), col = "black") +
geom_point(col = "black") +
coord_flip() +
# facet_wrap(~Variety, ncol = 3) +
theme_cup() +
labs(x = "", y = "", title = names(datives_variety_varimps)[i])
}
) |>
wrap_plots(ncol = 3) +
plot_annotation(caption = "Click to enlarge")Variety interactions in mixed-effects regression model
Next fit a generalized mixed-effects regression model to the data.
Set reference levels.
data_datives$Response <- relevel(data_datives$Response, ref = "do")
data_datives$ThemeBinComplexity <- relevel(data_datives$ThemeBinComplexity, ref = "simple") # Unlike in Mel's PhD
data_datives$RecPron <- relevel(data_datives$RecPron, ref = "non-pron") # unlike in Mel's PhD
data_datives$ThemePron <- relevel(data_datives$ThemePron, ref = "non-pron") # as with RecPronFormula.
f_reg_dats <- Response ~
logWeightRatio +
RecPron +
ThemeDefiniteness +
ThemeHeadFreq +
ThemePron +
variety_of_E +
logWeightRatio:variety_of_E + # According to Melanie's PhD, interacts with Variety
RecPron:variety_of_E + # According to Melanie's PhD, interacts with Variety
(1 | FileID_pruned) +
(1 | Verb_pruned)Fit mixed-effects logistic regression model.
glmer_ctrl <- glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))
if (!file.exists(here("model_output", "dat_glmer1.rds"))) {
glmer_datives1 <- glmer(f_reg_dats, data_datives, family = binomial, control = glmer_ctrl)
saveRDS(glmer_datives1, here("model_output", "dat_glmer1.rds"))
} else {
glmer_datives1 <- readRDS(here("model_output", "dat_glmer1.rds"))
}Performance measures of model.
model_performance(glmer_datives1)Random effect variances not available. Returned R2 does not account for random effects.
# Indices of model performance
AIC | AICc | BIC | R2 (cond.) | R2 (marg.) | RMSE | Sigma | Log_loss | Score_log | Score_spherical
-----------------------------------------------------------------------------------------------------------------
7164.061 | 7164.221 | 7403.605 | | 0.683 | 0.284 | 1.000 | 0.263 | -Inf | 7.085e-04
Precision and recall:
data_datives$fitted_glmer1 <- fitted(glmer_datives1)
data_datives$predicted_glmer1 <- as.factor(ifelse(data_datives$fitted_glmer1 >= .5, "pd", "do"))
caret::confusionMatrix(data_datives$predicted_glmer1, data_datives$Response,
mode = "prec_recall"
)Confusion Matrix and Statistics
Reference
Prediction do pd
do 8348 840
pd 639 3344
Accuracy : 0.8877
95% CI : (0.8822, 0.8931)
No Information Rate : 0.6823
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7376
Mcnemar's Test P-Value : 1.987e-07
Precision : 0.9086
Recall : 0.9289
F1 : 0.9186
Prevalence : 0.6823
Detection Rate : 0.6338
Detection Prevalence : 0.6976
Balanced Accuracy : 0.8641
'Positive' Class : do
Area under ROC curve:
data.frame(
obs = data_datives$Response,
pred = data_datives$predicted_glmer1,
do = 1 - data_datives$fitted_glmer1,
pd = data_datives$fitted_glmer1
) |>
caret::twoClassSummary(lev = levels(data_datives$Response)) ROC Sens Spec
0.9501111 0.9288973 0.7992352
Check for multicollinearity.
kappa_mer(glmer_datives1)[1] 41.94774
check_collinearity(glmer_datives1)# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI Increased SE Tolerance
ThemeDefiniteness 1.15 [ 1.13, 1.17] 1.07 0.87
ThemeHeadFreq 1.65 [ 1.61, 1.69] 1.28 0.61
ThemePron 1.45 [ 1.42, 1.48] 1.20 0.69
Tolerance 95% CI
[0.85, 0.89]
[0.59, 0.62]
[0.67, 0.70]
High Correlation
Term VIF VIF 95% CI Increased SE Tolerance
logWeightRatio 11.17 [ 10.81, 11.54] 3.34 0.09
RecPron 10.75 [ 10.41, 11.11] 3.28 0.09
variety_of_E 39.09 [ 37.80, 40.43] 6.25 0.03
logWeightRatio:variety_of_E 222.30 [214.87, 229.99] 14.91 4.50e-03
RecPron:variety_of_E 811.50 [784.34, 839.61] 28.49 1.23e-03
Tolerance 95% CI
[0.09, 0.09]
[0.09, 0.10]
[0.02, 0.03]
[0.00, 0.00]
[0.00, 0.00]
Check for overdispersion.
performance::check_overdispersion(glmer_datives1)# Overdispersion test
dispersion ratio = 0.914
Pearson's Chi-Squared = 12010.225
p-value = 1
Create Table 5.4 of regression model effects in the dative alternation.
model_parameters(glmer_datives1, effects = "fixed", verbose = FALSE) |>
filter(p < 0.05) |>
as.data.frame() |>
select(c(1:3, 7, 9)) |>
kable(digits = 3) |>
kable_styling()| Parameter | Coefficient | SE | z | p |
|---|---|---|---|---|
| (Intercept) | 1.255 | 0.365 | 3.442 | 0.001 |
| logWeightRatio | 1.491 | 0.127 | 11.717 | 0.000 |
| RecPronpron | -1.751 | 0.231 | -7.570 | 0.000 |
| ThemeDefinitenessdef | 0.656 | 0.073 | 8.987 | 0.000 |
| ThemeHeadFreq | 0.000 | 0.000 | 5.211 | 0.000 |
| ThemePronpron | 1.227 | 0.145 | 8.478 | 0.000 |
| variety_of_EHKE | 0.557 | 0.161 | 3.473 | 0.001 |
| variety_of_EIndE | 1.222 | 0.167 | 7.311 | 0.000 |
| variety_of_EPhlE | 0.497 | 0.167 | 2.970 | 0.003 |
| logWeightRatio:variety_of_EIndE | -0.338 | 0.169 | -1.993 | 0.046 |
| RecPronpron:variety_of_EIndE | -0.934 | 0.307 | -3.039 | 0.002 |
Create Figure 5.7
Code
dat_var_anim_partial_eff <- ggeffect(glmer_datives1, c("variety_of_E", "RecPron")) |>
as.data.frame() |>
mutate(
variety_of_E = factor(x, levels = var_levs)
)
dat_var_anim_partial_eff |>
ggplot(aes(variety_of_E, predicted, shape = group)) +
geom_vline(aes(xintercept = variety_of_E), color = "gray", linetype = "dashed") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = .2, position = position_dodge(width = .4)
) +
annotate("rect",
xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
alpha = 0.2, fill = c("gray")
) +
geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
labs(y = "Predicted probability of s-genitive", x = "Variety", title = "") +
scale_shape_manual(name = "PorAnimacyBin", values = c(16, 21)) +
theme_cup() +
theme( # plot.title = element_text(size=20, face="bold", vjust=2),
plot.title = element_blank(),
axis.title.x = element_text(size = 14, vjust = -0.5),
axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
axis.title.y = element_text(size = 14, vjust = 1.5),
legend.position = "top"
) +
scale_y_continuous(label = scales::percent)5.4 Particle placement alternation
Cross tabulate response and variety of English.
table(data_particle_verbs$variety_of_E, data_particle_verbs$Response) |>
make_prop_table() |>
kable() |>
kable_styling()| Continuous | % | Split | % | Total | |
|---|---|---|---|---|---|
| BrE | 939 | 70.9 | 385 | 29.1 | 1324 |
| CanE | 964 | 69.3 | 428 | 30.7 | 1392 |
| IrE | 989 | 73.4 | 358 | 26.6 | 1347 |
| NZE | 1059 | 68.6 | 484 | 31.4 | 1543 |
| HKE | 1028 | 85.1 | 180 | 14.9 | 1208 |
| IndE | 977 | 90.2 | 106 | 9.8 | 1083 |
| JamE | 944 | 85.4 | 162 | 14.6 | 1106 |
| PhlE | 868 | 86.5 | 136 | 13.5 | 1004 |
| SgE | 1130 | 84.8 | 203 | 15.2 | 1333 |
| Total | 8898 | NA | 2442 | NA | 11340 |
Variable importance in Global random forest model
Now fit conditional random forest to the datasets.
# formula
pv_f1 <- Response ~
DirObjLettLength +
DirObjDefiniteness +
DirObjGivenness +
DirObjConcreteness +
DirObjThematicity +
DirectionalPP +
Semantics +
Surprisal.P
pv_f2 <- update(pv_f1, . ~ . + variety_of_E + Circle + GenreCoarse)
pv_f3 <- update(pv_f1, . ~ . + GenreCoarse)Run the forest.
if (!file.exists(here("model_output", "pv_forest1.rds"))) {
pv_forest1 <- party::cforest(pv_f2, data_particle_verbs)
saveRDS(pv_forest1, here("model_output", "pv_forest1.rds"))
} else {
pv_forest1 <- here("model_output", "pv_forest1.rds") |>
readRDS()
}
invisible(gc(verbose = FALSE)) # garbage collect to save RAMGet predictions.
if (!file.exists(here("model_output", "pv_forest1_preds.rds"))) {
pv_forest1_preds <- unlist(party::treeresponse(pv_forest1))[c(FALSE, TRUE)]
saveRDS(pv_forest1_preds, here("model_output", "pv_forest1_preds.rds"))
} else {
pv_forest1_preds <- here("model_output", "pv_forest1_preds.rds") |>
readRDS()
}
invisible(gc(verbose = FALSE))Check predictive performance.
data_particle_verbs$fitted_crf1 <- pv_forest1_preds
data_particle_verbs$predicted_crf1 <- as.factor(ifelse(data_particle_verbs$fitted_crf1 >= .5, "Split", "Continuous"))
caret::confusionMatrix(data_particle_verbs$predicted_crf1, data_particle_verbs$Response,
mode = "prec_recall"
)Confusion Matrix and Statistics
Reference
Prediction Continuous Split
Continuous 8618 1121
Split 280 1321
Accuracy : 0.8765
95% CI : (0.8703, 0.8825)
No Information Rate : 0.7847
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5822
Mcnemar's Test P-Value : < 2.2e-16
Precision : 0.8849
Recall : 0.9685
F1 : 0.9248
Prevalence : 0.7847
Detection Rate : 0.7600
Detection Prevalence : 0.8588
Balanced Accuracy : 0.7547
'Positive' Class : Continuous
Area under ROC curve.
data.frame(
obs = data_particle_verbs$Response,
pred = data_particle_verbs$predicted_crf1,
Continuous = 1 - data_particle_verbs$fitted_crf1,
Split = data_particle_verbs$fitted_crf1
) |>
caret::twoClassSummary(lev = levels(data_particle_verbs$Response)) ROC Sens Spec
0.9389622 0.9685323 0.5409500
Variable importance.
if (!file.exists(here("model_output", "pv_forest1_varimp.rds"))) {
pv_forest1_varimp <- permimp(pv_forest1, conditional = FALSE, progressBar = FALSE, asParty = TRUE)
saveRDS(pv_forest1_varimp, here("model_output", "pv_forest1_varimp.rds"))
} else {
pv_forest1_varimp <- here("model_output", "pv_forest1_varimp.rds") |>
readRDS()
}Create Figure 5.8
pv_forest1_varimp$values |>
as.data.frame() |>
rename(Varimp = "pv_forest1_varimp$values") |>
rownames_to_column("Variable") |>
ggplot(aes(reorder(Variable, Varimp), Varimp)) +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = Variable, yend = 0), col = "black") +
geom_point(col = "black") +
coord_flip() +
theme_cup() +
labs(x = "", y = "Variable Importance")Variable importance in by-variety random forest models
Next we fit CRFs to each variety individually.
data_particle_verbs_split <- split(data_particle_verbs, data_particle_verbs$variety_of_E)Now fit random forest for each variety.
if (file.exists(here("model_output", "pv_variety_forests.rds"))) {
pv_variety_forests <- readRDS(
here("model_output", "datives_variety_forests.rds")
)
} else {
pv_variety_forests <- lapply(
data_particle_verbs_split, function(x) fit_party_crf(pv_f3, x)
)
saveRDS(pv_variety_forests, here("model_output", "pv_variety_forests.rds"))
}Now get the variable importances.
if (file.exists(here("model_output", "pv_variety_varimps.rds"))) {
pv_variety_varimps <- readRDS(
here("model_output", "pv_variety_varimps.rds")
)
} else {
pv_variety_varimps <- lapply(
pv_variety_forests, function(x) get_party_varimp(x)$values
)
saveRDS(pv_variety_varimps, here("model_output", "pv_variety_varimps.rds"))
}Create Figure 5.9
Code
preds <- pv_variety_varimps[[1]] |>
sort() |>
names()
lapply(
seq_along(pv_variety_varimps),
function(i) {
x <- pv_variety_varimps[[i]]
x |>
as.data.frame() |>
rename(Varimp = 1) |>
rownames_to_column("Variable") |>
mutate(Variable = factor(Variable, levels = preds)) |>
# bind_rows() |>
# mutate(
# Variety = rep(names(genitives_variety_varimps), each = 9)
# ) |>
ggplot(aes(Variable, Varimp)) +
geom_hline(yintercept = 0) +
geom_segment(aes(xend = Variable, yend = 0), col = "black") +
geom_point(col = "black") +
coord_flip() +
# facet_wrap(~Variety, ncol = 3) +
theme_cup() +
labs(x = "", y = "", title = names(pv_variety_varimps)[i])
}
) |>
wrap_plots(ncol = 3) +
plot_annotation(caption = "Click to enlarge")Variety interactions in mixed effects regression model
Next fit a generalized mixed-effects regression model to the data.
data_particle_verbs$Response <- relevel(data_particle_verbs$Response, ref = "Continuous") # predicted odds are for the split pattern
data_particle_verbs$Semantics <- relevel(data_particle_verbs$Semantics, ref = "non-compositional")
data_particle_verbs$DirectionalPP <- relevel(data_particle_verbs$DirectionalPP, ref = "no")
data_particle_verbs$Circle <- relevel(data_particle_verbs$Circle, ref = "Inner")Formula.
f_reg_pv <- Response ~
Surprisal.P +
DirObjLettLength +
Semantics +
DirectionalPP +
DirObjThematicity +
Circle +
DirectionalPP:Circle +
Semantics:Circle +
(1 | variety_of_E) +
(1 | FileID_pruned) +
(1 | VerbPart_pruned)Fit mixed-effects logistic regression model:
glmer_ctrl <- glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))
if (!file.exists(here("model_output", "pv_glmer1.rds"))) {
glmer_particle_verbs1 <- glmer(f_reg_pv, data_particle_verbs, family = binomial, control = glmer_ctrl)
saveRDS(glmer_particle_verbs1, here("model_output", "pv_glmer1.rds"))
} else {
glmer_particle_verbs1 <- readRDS(here("model_output", "pv_glmer1.rds"))
}Performance measures of model.
model_performance(glmer_particle_verbs1)# Indices of model performance
AIC | AICc | BIC | R2 (cond.) | R2 (marg.) | ICC | RMSE | Sigma | Log_loss | Score_log | Score_spherical
-------------------------------------------------------------------------------------------------------------------------
8055.950 | 8055.977 | 8143.983 | 0.598 | 0.302 | 0.424 | 0.329 | 1.000 | 0.340 | -Inf | 9.466e-04
Precision and recall.
data_particle_verbs$fitted_glmer1 <- fitted(glmer_particle_verbs1)
data_particle_verbs$predicted_glmer1 <- as.factor(ifelse(data_particle_verbs$fitted_glmer1 >= .5, "Split", "Continuous"))
caret::confusionMatrix(data_particle_verbs$predicted_glmer1, data_particle_verbs$Response,
mode = "prec_recall"
)Confusion Matrix and Statistics
Reference
Prediction Continuous Split
Continuous 8431 1275
Split 467 1167
Accuracy : 0.8464
95% CI : (0.8396, 0.853)
No Information Rate : 0.7847
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4834
Mcnemar's Test P-Value : < 2.2e-16
Precision : 0.8686
Recall : 0.9475
F1 : 0.9064
Prevalence : 0.7847
Detection Rate : 0.7435
Detection Prevalence : 0.8559
Balanced Accuracy : 0.7127
'Positive' Class : Continuous
Area under ROC curve.
data.frame(
obs = data_particle_verbs$Response,
pred = data_particle_verbs$predicted_glmer1,
Continuous = 1 - data_particle_verbs$fitted_glmer1,
Split = data_particle_verbs$fitted_glmer1
) |>
caret::twoClassSummary(lev = levels(data_particle_verbs$Response)) ROC Sens Spec
0.8783687 0.9475163 0.4778870
Check multicollinearity
kappa_mer(glmer_particle_verbs1)[1] 7.148104
performance::check_collinearity(glmer_particle_verbs1)# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
Surprisal.P 1.01 [1.00, 1.18] 1.00 0.99 [0.85, 1.00]
DirObjLettLength 1.02 [1.01, 1.05] 1.01 0.98 [0.95, 0.99]
Semantics 1.59 [1.55, 1.63] 1.26 0.63 [0.61, 0.64]
DirectionalPP 1.82 [1.77, 1.87] 1.35 0.55 [0.54, 0.57]
DirObjThematicity 1.02 [1.00, 1.05] 1.01 0.98 [0.95, 1.00]
Circle 1.21 [1.18, 1.24] 1.10 0.83 [0.81, 0.84]
DirectionalPP:Circle 1.85 [1.81, 1.90] 1.36 0.54 [0.52, 0.55]
Semantics:Circle 1.75 [1.71, 1.80] 1.32 0.57 [0.56, 0.58]
Check overdispersion.
performance::check_overdispersion(glmer_particle_verbs1)# Overdispersion test
dispersion ratio = 0.821
Pearson's Chi-Squared = 9300.361
p-value = 1
Create Table 5.6 of regression model effects in the particle placement alternation.
model_parameters(glmer_particle_verbs1, effects = "fixed", verbose = FALSE) |>
filter(Parameter == "(Intercept)" | p < 0.05) |>
as.data.frame() |>
select(c(1:3, 7, 9)) |>
kable(digits = 3) |>
kable_styling()| Parameter | Coefficient | SE | z | p |
|---|---|---|---|---|
| (Intercept) | -0.944 | 0.604 | -1.564 | 0.118 |
| Surprisal.P | 0.257 | 0.021 | 12.548 | 0.000 |
| DirObjLettLength | -0.142 | 0.006 | -22.578 | 0.000 |
| Semanticscompositional | 0.930 | 0.080 | 11.602 | 0.000 |
| DirectionalPPyes | 1.493 | 0.122 | 12.236 | 0.000 |
| DirObjThematicity | 0.034 | 0.005 | 7.037 | 0.000 |
| CircleOuter | -0.801 | 0.132 | -6.085 | 0.000 |
| DirectionalPPyes:CircleOuter | -0.439 | 0.174 | -2.520 | 0.012 |
| Semanticscompositional:CircleOuter | -0.319 | 0.119 | -2.681 | 0.007 |
Create Figure 5.10, the barplot of by-variety intercept adjustments.
nms <- rownames(ranef(glmer_particle_verbs1)$variety_of_E)
intercepts <- ranef(glmer_particle_verbs1)$variety_of_E[, 1]
support <- tapply(data_particle_verbs$variety_of_E, data_particle_verbs$variety_of_E, length)
labels <- paste(nms)
barplot(intercepts[order(intercepts)], names.arg = labels[order(intercepts)], horiz = FALSE, las = 2, cex.names = 0.8, family = "serif")Create Figure 5.11, the partial effects plot for DirectionalPP:Circle interaction.
Code
pv_var_dirPP_partial_eff <- effects::Effect(c("Circle", "DirectionalPP"), glmer_particle_verbs1) |>
as.data.frame()
pv_var_dirPP_partial_eff |>
ggplot(aes(Circle, fit, shape = DirectionalPP)) +
geom_errorbar(aes(ymin = lower, ymax = upper),
width = .2, position = position_dodge(width = .4)
) +
annotate("rect",
xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
alpha = 0.2, fill = c("gray")
) +
geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
labs(y = "Predicted probability of split variant", x = "Circle", title = "") +
# scale_shape_manual(name = "PorAnimacyBin", values = c(21, 16)) +
theme_cup() +
theme( # plot.title = element_text(size=20, face="bold", vjust=2),
plot.title = element_blank(),
axis.title.x = element_text(size = 14, vjust = -0.5),
axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
axis.title.y = element_text(size = 14, vjust = 1.5),
legend.position = "top"
) +
scale_y_continuous(limits = c(0, .9), breaks = seq(0, .9, .1), label = scales::percent)CIRCLE with DIRECTIONALPP in the particle placement model. Predicted probabilities (vertical axis, expressed in percent) are for the split variant. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. The Inner Circle (shaded) serves as the reference category in the underlying regression model.Create Figure 5.12, the partial effects plot for DirectionalPP:Semantics interaction.
Code
pv_var_semantics_partial_eff <- effects::Effect(c("Circle", "Semantics"), glmer_particle_verbs1) |>
as.data.frame() |>
mutate(
Semantics = factor(Semantics, levels = c("compositional", "non-compositional"))
)
pv_var_semantics_partial_eff |>
ggplot(aes(Circle, fit, shape = Semantics)) +
geom_errorbar(aes(ymin = lower, ymax = upper),
width = .2, position = position_dodge(width = .4)
) +
annotate("rect",
xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
alpha = 0.2, fill = c("gray")
) +
geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
labs(y = "Predicted probability of split variant", x = "Circle", title = "") +
# scale_shape_manual(name = "PorAnimacyBin", values = c(21, 16)) +
theme_cup() +
theme( # plot.title = element_text(size=20, face="bold", vjust=2),
plot.title = element_blank(),
axis.title.x = element_text(size = 14, vjust = -0.5),
axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
axis.title.y = element_text(size = 14, vjust = 1.5),
legend.position = "top"
) +
scale_y_continuous(limits = c(0, .9), breaks = seq(0, .9, .1), label = scales::percent)CIRCLE with SEMANTICS in the particle placement model. Predicted probabilities (vertical axis, expressed in percent) are for the split variant. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. The Inner Circle (shaded) serves as the reference category in the underlying regression model.Interaction of binary length in mixed effects regression model
Fit model with binned length.
data_particle_verbs <- data_particle_verbs |>
mutate(
DirObjLettLength_bin = ifelse(DirObjLettLength <= 11, "short", "long") |>
as.factor()
)
data_particle_verbs$DirObjLettLength_bin <- relevel(data_particle_verbs$DirObjLettLength_bin, ref = "short")New model.
f_reg_pv2 <- Response ~
Surprisal.P +
DirObjLettLength_bin +
Semantics +
DirectionalPP +
DirObjThematicity +
Circle +
DirObjLettLength_bin:Circle +
DirectionalPP:Circle +
Semantics:Circle +
(1 | variety_of_E) +
(1 | FileID_pruned) +
(1 | VerbPart_pruned)
if (!file.exists(here("model_output", "pv_glmer2.rds"))) {
glmer_particle_verbs2 <- glmer(f_reg_pv2, data_particle_verbs, family = binomial, control = glmer_ctrl)
saveRDS(glmer_particle_verbs2, here("model_output", "pv_glmer2.rds"))
} else {
glmer_particle_verbs2 <- readRDS(here("model_output", "pv_glmer2.rds"))
}Performance measures of model.
model_performance(glmer_particle_verbs2)# Indices of model performance
AIC | AICc | BIC | R2 (cond.) | R2 (marg.) | ICC | RMSE | Sigma | Log_loss | Score_log | Score_spherical
-------------------------------------------------------------------------------------------------------------------------
8143.988 | 8144.020 | 8239.357 | 0.571 | 0.242 | 0.433 | 0.330 | 1.000 | 0.344 | -Inf | 6.577e-04
Precision and recall.
data_particle_verbs$fitted_glmer2 <- fitted(glmer_particle_verbs2)
data_particle_verbs$predicted_glmer2 <- as.factor(ifelse(data_particle_verbs$fitted_glmer2 >= .5, "Split", "Continuous"))
caret::confusionMatrix(data_particle_verbs$predicted_glmer2, data_particle_verbs$Response,
mode = "prec_recall"
)Confusion Matrix and Statistics
Reference
Prediction Continuous Split
Continuous 8431 1310
Split 467 1132
Accuracy : 0.8433
95% CI : (0.8365, 0.8499)
No Information Rate : 0.7847
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4699
Mcnemar's Test P-Value : < 2.2e-16
Precision : 0.8655
Recall : 0.9475
F1 : 0.9047
Prevalence : 0.7847
Detection Rate : 0.7435
Detection Prevalence : 0.8590
Balanced Accuracy : 0.7055
'Positive' Class : Continuous
Area under ROC curve.
data.frame(
obs = data_particle_verbs$Response,
pred = data_particle_verbs$predicted_glmer2,
Continuous = 1 - data_particle_verbs$fitted_glmer2,
Split = data_particle_verbs$fitted_glmer2
) |>
caret::twoClassSummary(lev = levels(data_particle_verbs$Response)) ROC Sens Spec
0.8750134 0.9475163 0.4635545
Check multicollinearity
kappa_mer(glmer_particle_verbs2)[1] 8.440186
performance::check_collinearity(glmer_particle_verbs2)# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI Increased SE Tolerance
Surprisal.P 1.01 [1.00, 1.20] 1.00 0.99
DirObjLettLength_bin 1.61 [1.57, 1.65] 1.27 0.62
Semantics 1.57 [1.54, 1.62] 1.25 0.64
DirectionalPP 1.79 [1.75, 1.84] 1.34 0.56
DirObjThematicity 1.01 [1.00, 1.06] 1.01 0.99
Circle 1.28 [1.25, 1.31] 1.13 0.78
DirObjLettLength_bin:Circle 1.67 [1.62, 1.71] 1.29 0.60
DirectionalPP:Circle 1.83 [1.78, 1.88] 1.35 0.55
Semantics:Circle 1.73 [1.68, 1.77] 1.31 0.58
Tolerance 95% CI
[0.84, 1.00]
[0.60, 0.64]
[0.62, 0.65]
[0.54, 0.57]
[0.94, 1.00]
[0.76, 0.80]
[0.58, 0.62]
[0.53, 0.56]
[0.56, 0.59]
Check overdispersion.
performance::check_overdispersion(glmer_particle_verbs2)# Overdispersion test
dispersion ratio = 0.848
Pearson's Chi-Squared = 9599.862
p-value = 1
Create Figure 5.13, the partial effects plot for Circle x DirObj length interaction.
Code
pv_var_length_partial_eff <- effects::Effect(c("Circle", "DirObjLettLength_bin"), glmer_particle_verbs2) |>
as.data.frame() |>
mutate(
DirObjLettLength_bin = factor(DirObjLettLength_bin, levels = c("long", "short"))
)
pv_var_length_partial_eff |>
ggplot(aes(Circle, fit, shape = DirObjLettLength_bin)) +
geom_errorbar(aes(ymin = lower, ymax = upper),
width = .2, position = position_dodge(width = .4)
) +
annotate("rect",
xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
alpha = 0.2, fill = c("gray")
) +
geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
labs(y = "Predicted probability of split variant", x = "Circle", title = "") +
# scale_shape_manual(name = "PorAnimacyBin", values = c(21, 16)) +
theme_cup() +
theme( # plot.title = element_text(size=20, face="bold", vjust=2),
plot.title = element_blank(),
axis.title.x = element_text(size = 14, vjust = -0.5),
axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
axis.title.y = element_text(size = 14, vjust = 1.5),
legend.position = "top"
) +
scale_y_continuous(limits = c(0, .9), breaks = seq(0, .9, .1), label = scales::percent)Circle with DirObjLettLength_bin in the particle placement model. Predicted probabilities (vertical axis, expressed in percent) are for the split variant. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. The Inner Circle (shaded) serves as the reference category in the underlying regression model.Remove large model objects that we don’t need to clean up workspace (optional).
rm(
gen_forest1, genitives_variety_forests, glmer_genitives1,
dat_forest1, datives_variety_forests, glmer_datives1,
pv_forest1, pv_variety_forests, glmer_particle_verbs1, glmer_particle_verbs2
)
invisible(gc(verbose = FALSE)) # free unused memoryChapter 6
In this chapter we run the Variation-Based Distance & Similarity Modeling (VADIS) analyses of the three alternations. Note that because line 3 variable importance scores are based on random forest models, results will vary from run to run. We recommend tuning models before running, and setting random seeds for exact reproducibility (not done here).
6.4 Quantification of Similarity Coefficients
The code here shows only the analysis for the entire datasets. See @#sec-append-6.4 in the Appendix for the code for the subsets of the data (spoken/written only, Inner/Outer Circle only).
All available data
Start with the GENITIVE alternation. First run the by-variety regression models.
# Add random effects
gen_f4 <- update(gen_f1, . ~ . + (1 | PorHeadLemma_pruned) + (1 | GenreCoarse))
if (!file.exists(here("model_output", "gen_glmer_list.rds"))) {
genitives_variety_glmer <- vector("list")
for (i in seq_along(data_genitives_split)) {
d <- data_genitives_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
# fit the model
genitives_variety_glmer[[i]] <- glmer(gen_f4,
data = d_std, family = binomial, # set optimizer controls to help convergence
control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))
)
rm(d, d_std) # remove datasets
}
names(genitives_variety_glmer) <- names(data_genitives_split)
saveRDS(genitives_variety_glmer, here("model_output", "gen_glmer_list.rds"))
} else {
genitives_variety_glmer <- readRDS(here("model_output", "gen_glmer_list.rds"))
}Check models.
round(VADIS::summary_stats(genitives_variety_glmer), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 1612 0.729 0.854 0.102 0.910 0.328 1163.020 1.205 1.958
CanE 1360 0.616 0.854 0.108 0.920 0.346 1018.358 1.228 2.243
IrE 1313 0.703 0.843 0.111 0.903 0.350 1001.838 1.149 1.978
NZE 1867 0.694 0.849 0.107 0.905 0.347 1418.148 1.109 1.940
HKE 1529 0.680 0.853 0.106 0.917 0.336 1135.965 1.123 2.128
IndE 1836 0.752 0.874 0.091 0.921 0.297 1212.166 1.133 1.923
JamE 1307 0.772 0.876 0.088 0.919 0.286 847.436 1.212 1.981
PhlE 1749 0.719 0.851 0.110 0.899 0.350 1323.141 1.092 1.912
SgE 1225 0.656 0.849 0.112 0.910 0.361 991.788 1.225 2.365
HosLem.p
BrE 0.014
CanE 0.550
IrE 0.527
NZE 0.031
HKE 0.767
IndE 0.565
JamE 0.564
PhlE 0.594
SgE 0.004
Now fit random forest for each variety.
if (file.exists(here("model_output", "gen_crf_vadis_list.rds"))) {
genitives_variety_forests_vadis <- readRDS(
here("model_output", "gen_crf_vadis_list.rds")
)
} else {
genitives_variety_forests_vadis <- lapply(
data_genitives_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(genitives_variety_forests_vadis, here("model_output", "gen_crf_vadis_list.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
gen_signif_line <- vadis_line1(genitives_variety_glmer,
path = here("model_output", "gens_line1.rds"),
overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer,
path = here("model_output", "gens_line2.rds"),
overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis,
path = here("model_output", "gens_line3.rds"),
conditional = FALSE, overwrite = "reload"
)Now look at the mean values by line and variety.
gen_mean_sims <- data.frame(
line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = gen_coef_line$similarity.scores[, 1],
line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_genitives_split), "Line Mean")
round(gen_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.931 0.675 0.845 0.817
CanE 0.889 0.713 0.890 0.831
IrE 0.931 0.752 0.848 0.844
NZE 0.889 0.727 0.890 0.835
HKE 0.861 0.679 0.705 0.749
IndE 0.861 0.619 0.780 0.753
JamE 0.931 0.703 0.786 0.806
PhlE 0.889 0.623 0.810 0.774
SgE 0.931 0.727 0.845 0.834
Line Mean 0.901 0.691 0.822 0.805
Next the DATIVE alternation.
# Add random effects
dat_f4 <- update(dat_f1, . ~ . + (1 | Verb_pruned) + (1 | GenreCoarse))
if (!file.exists(here("model_output", "dat_glmer_list.rds"))) {
datives_variety_glmer <- vector("list")
for (i in seq_along(data_datives_split)) {
d <- data_datives_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
# fit the model
datives_variety_glmer[[i]] <- glmer(dat_f4,
data = d_std, family = binomial, # set optimizer controls to help convergence
control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))
)
rm(d, d_std) # remove datasets
}
names(datives_variety_glmer) <- names(data_datives_split)
saveRDS(datives_variety_glmer, here("model_output", "dat_glmer_list.rds"))
} else {
datives_variety_glmer <- readRDS(here("model_output", "dat_glmer_list.rds"))
}Check models.
round(VADIS::summary_stats(datives_variety_glmer), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 1318 0.707 0.891 0.077 0.949 0.257 775.557 2.538 4.009
CanE 1357 0.713 0.900 0.072 0.960 0.230 725.754 2.089 3.759
IrE 1270 0.728 0.896 0.074 0.953 0.244 724.136 2.081 4.025
NZE 1485 0.711 0.894 0.080 0.946 0.263 893.623 2.228 3.747
HKE 1768 0.639 0.890 0.081 0.952 0.265 1064.279 2.003 3.929
IndE 1550 0.588 0.898 0.076 0.956 0.259 932.243 2.017 4.241
JamE 1375 0.711 0.925 0.059 0.969 0.202 665.191 2.412 4.107
PhlE 1505 0.661 0.909 0.069 0.963 0.229 794.234 2.154 4.134
SgE 1543 0.709 0.888 0.079 0.951 0.256 888.064 2.447 4.001
HosLem.p
BrE 0.482
CanE 0.828
IrE 0.308
NZE 0.082
HKE 0.239
IndE 0.002
JamE 0.327
PhlE 0.927
SgE 0.374
if (file.exists(here("model_output", "dat_crf_vadis_list.rds"))) {
datives_variety_forests_vadis <- readRDS(
here("model_output", "dat_crf_vadis_list.rds")
)
} else {
datives_variety_forests_vadis <- lapply(
data_datives_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(datives_variety_forests_vadis, here("model_output", "dat_crf_vadis_list.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
dat_signif_line <- vadis_line1(datives_variety_glmer,
path = here("model_output", "dat_line1.rds"),
overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer,
path = here("model_output", "dat_line2.rds"),
overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis,
path = here("model_output", "dat_line3.rds"),
conditional = FALSE, overwrite = "reload"
)Now look at the mean values by line and variety.
dat_mean_sims <- data.frame(
line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = dat_coef_line$similarity.scores[, 1],
line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_genitives_split), "Line Mean")
round(dat_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.703 0.738 0.732 0.724
CanE 0.625 0.750 0.685 0.687
IrE 0.656 0.711 0.688 0.685
NZE 0.781 0.774 0.723 0.760
HKE 0.641 0.744 0.771 0.719
IndE 0.719 0.720 0.720 0.720
JamE 0.688 0.553 0.842 0.694
PhlE 0.656 0.712 0.732 0.700
SgE 0.781 0.735 0.798 0.771
Line Mean 0.694 0.715 0.743 0.718
Finally, the PARTICLE PLACEMENT alternation.
# Add random effects
# f2 <- update(f1, .~. + (1|FileID) + (1|VerbPart_pruned) + (1|DirObjHeadPlain_pruned) + (1|GenreCoarse))
pv_f4 <- update(pv_f1, . ~ . + (1 | VerbPart_pruned) + (1 | Genre))
if (!file.exists(here("model_output", "pv_glmer_list.rds"))) {
pv_variety_glmer <- vector("list")
for (i in seq_along(data_particle_verbs_split)) {
d <- data_particle_verbs_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
# fit the model
pv_variety_glmer[[i]] <- glmer(pv_f4,
data = d_std, family = binomial, # set optimizer controls to help convergence
control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))
)
rm(d, d_std) # remove datasets
}
names(pv_variety_glmer) <- names(data_particle_verbs_split)
saveRDS(pv_variety_glmer, here("model_output", "pv_glmer_list.rds"))
} else {
pv_variety_glmer <- readRDS(here("model_output", "pv_glmer_list.rds"))
}Check models.
round(VADIS::summary_stats(pv_variety_glmer), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 1324 0.709 0.832 0.117 0.893 0.365 1118.141 1.164 1.699
CanE 1392 0.693 0.830 0.114 0.905 0.354 1161.421 1.148 1.821
IrE 1347 0.734 0.834 0.114 0.891 0.358 1125.746 1.167 1.714
NZE 1543 0.686 0.854 0.108 0.912 0.346 1267.649 1.133 1.666
HKE 1208 0.851 0.882 0.082 0.898 0.265 761.069 1.180 1.767
IndE 1083 0.902 0.922 0.064 0.877 0.219 543.611 1.260 1.928
JamE 1106 0.854 0.892 0.083 0.880 0.276 709.759 1.253 1.726
PhlE 1004 0.865 0.888 0.083 0.885 0.265 602.609 1.215 1.686
SgE 1333 0.848 0.893 0.081 0.908 0.260 828.958 1.201 1.689
HosLem.p
BrE 0.406
CanE 0.051
IrE 0.021
NZE 0.196
HKE 0.125
IndE 0.549
JamE 0.181
PhlE 0.855
SgE 0.042
if (file.exists(here("model_output", "pv_crf_vadis_list.rds"))) {
pv_variety_forests_vadis <- readRDS(
here("model_output", "pv_crf_vadis_list.rds")
)
} else {
pv_variety_forests_vadis <- lapply(
data_particle_verbs_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(pv_variety_forests_vadis, here("model_output", "pv_crf_vadis_list.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
pv_signif_line <- vadis_line1(pv_variety_glmer,
path = here("model_output", "pv_line1.rds"),
overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer,
path = here("model_output", "pv_line2.rds"),
overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis,
path = here("model_output", "pv_line3.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
pv_mean_sims <- data.frame(
line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = pv_coef_line$similarity.scores[, 1],
line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_genitives_split), "Line Mean")
round(pv_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.719 0.798 0.780 0.765
CanE 0.766 0.759 0.744 0.756
IrE 0.719 0.806 0.804 0.776
NZE 0.766 0.772 0.798 0.779
HKE 0.797 0.792 0.756 0.781
IndE 0.609 0.664 0.613 0.629
JamE 0.734 0.815 0.708 0.753
PhlE 0.812 0.740 0.631 0.728
SgE 0.766 0.819 0.774 0.786
Line Mean 0.743 0.774 0.734 0.750
Now create Table 6.5
tab <- data.frame(
Genitives = unlist(gen_mean_sims[10, ]),
Datives = unlist(dat_mean_sims[10, ]),
`Particle placement` = unlist(pv_mean_sims[10, ])
)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
kable(tab, digits = 2) |>
kable_styling()| Genitives | Datives | Particle.placement | |
|---|---|---|---|
| 1st line | 0.90 | 0.69 | 0.74 |
| 2nd line | 0.69 | 0.72 | 0.77 |
| 3rd line | 0.82 | 0.74 | 0.73 |
| mean | 0.80 | 0.72 | 0.75 |
Core Grammar Score:
round(rowMeans(tab[4, ]), 3) mean
0.758
For simplicity, we’ve recreated Table 6.6 here. Again, all code for the analyses is in Section 6.1.
| Core grammar score (Γ) | Hierarchy of stability | |
|---|---|---|
| All data | 0.76 | genitives > particles > datives |
| Spoken only | 0.62 | particles > datives > genitives |
| Writen only | 0.75 | genitives > datives > particles |
| Inner Circle only | 0.79 | genitives > particles > datives |
| Outer Circle only | 0.73 | genitives > datives > particles |
6.5 Evaluating Coherence
Figure 6.1 is shown here. Again, the numbers will vary slightly from run to run due to the random nature of Random Forests.
gen_varimp_line$distance.matrix BrE CanE IrE NZE HKE IndE
CanE 0.09523810
IrE 0.09523810 0.07142857
NZE 0.09523810 0.00000000 0.07142857
HKE 0.42857143 0.21428571 0.30952381 0.21428571
IndE 0.04761905 0.16666667 0.23809524 0.16666667 0.47619048
JamE 0.23809524 0.11904762 0.04761905 0.11904762 0.23809524 0.38095238
PhlE 0.14285714 0.16666667 0.19047619 0.16666667 0.19047619 0.19047619
SgE 0.09523810 0.04761905 0.19047619 0.04761905 0.28571429 0.09523810
JamE PhlE
CanE
IrE
NZE
HKE
IndE
JamE
PhlE 0.28571429
SgE 0.28571429 0.19047619
Compare the Distance-Based Coherence (DBC). Get the fused distance matrices, combining all lines of evidence. Then calculate pairwise correlations between the three alternations.
# genitives across lines of evidence
dfused_genitives_all <- analogue::fuse(
gen_signif_line$distance.matrix,
gen_coef_line$distance.matrix,
gen_varimp_line$distance.matrix
)
# datives across lines of evidence
dfused_datives_all <- analogue::fuse(
dat_signif_line$distance.matrix,
dat_coef_line$distance.matrix,
dat_varimp_line$distance.matrix
)
# particles across lines of evidence
dfused_particles_all <- analogue::fuse(
pv_signif_line$distance.matrix,
pv_coef_line$distance.matrix,
pv_varimp_line$distance.matrix
)
# fusing fused distance matrices across alternations
dfused_total_all <- analogue::fuse(
dfused_genitives_all,
dfused_datives_all,
dfused_particles_all
)
gen_dat_cor <- vegan::mantel(dfused_genitives_all, dfused_datives_all)
gen_pv_cor <- vegan::mantel(dfused_genitives_all, dfused_particles_all)
dat_pv_cor <- vegan::mantel(dfused_datives_all, dfused_particles_all)Create Table 6.7.
data.frame(
Comparison = c("genitive / dative", "genitive / particle", "dative / particle"),
r = round(c(gen_dat_cor$statistic, gen_pv_cor$statistic, dat_pv_cor$statistic), 2),
pval = round(c(gen_dat_cor$signif, gen_pv_cor$signif, dat_pv_cor$signif), 2)
) |>
kable() |>
kable_styling()| Comparison | r | pval |
|---|---|---|
| genitive / dative | 0.18 | 0.24 |
| genitive / particle | 0.43 | 0.04 |
| dative / particle | 0.18 | 0.18 |
Next calculate by-line pairwise correlations for each alternation, as in Table 6.8.
gen_line12 <- vegan::mantel(gen_signif_line$distance.matrix, gen_coef_line$distance.matrix)
gen_line13 <- vegan::mantel(gen_signif_line$distance.matrix, gen_varimp_line$distance.matrix)
gen_line23 <- vegan::mantel(gen_coef_line$distance.matrix, gen_varimp_line$distance.matrix)
dat_line12 <- vegan::mantel(dat_signif_line$distance.matrix, dat_coef_line$distance.matrix)
dat_line13 <- vegan::mantel(dat_signif_line$distance.matrix, dat_varimp_line$distance.matrix)
dat_line23 <- vegan::mantel(dat_coef_line$distance.matrix, dat_varimp_line$distance.matrix)
pv_line12 <- vegan::mantel(pv_signif_line$distance.matrix, pv_coef_line$distance.matrix)
pv_line13 <- vegan::mantel(pv_signif_line$distance.matrix, pv_varimp_line$distance.matrix)
pv_line23 <- vegan::mantel(pv_coef_line$distance.matrix, pv_varimp_line$distance.matrix)Create Table 6.8
data.frame(
Comparison = c("overlap 1st / 2nd lines", "overlap 1st / 3rd lines", "overlap 2nd / 3rd lines"),
Genitive = c(
paste0("r = ", round(gen_line12$statistic, 2), " (p = ", round(gen_line12$signif, 2), ")"),
paste0("r = ", round(gen_line13$statistic, 2), " (p = ", round(gen_line13$signif, 2), ")"),
paste0("r = ", round(gen_line23$statistic, 2), " (p = ", round(gen_line23$signif, 2), ")")
),
Dative = c(
paste0("r = ", round(dat_line12$statistic, 2), " (p = ", round(dat_line12$signif, 2), ")"),
paste0("r = ", round(dat_line13$statistic, 2), " (p = ", round(dat_line13$signif, 2), ")"),
paste0("r = ", round(dat_line23$statistic, 2), " (p = ", round(dat_line23$signif, 2), ")")
),
Particle = c(
paste0("r = ", round(pv_line12$statistic, 2), " (p = ", round(pv_line12$signif, 2), ")"),
paste0("r = ", round(pv_line13$statistic, 2), " (p = ", round(pv_line13$signif, 2), ")"),
paste0("r = ", round(pv_line23$statistic, 2), " (p = ", round(pv_line23$signif, 2), ")")
)
) |>
kable() |>
kable_styling()| Comparison | Genitive | Dative | Particle |
|---|---|---|---|
| overlap 1st / 2nd lines | r = 0.41 (p = 0.01) | r = 0.12 (p = 0.34) | r = 0.36 (p = 0.06) |
| overlap 1st / 3rd lines | r = 0.04 (p = 0.4) | r = 0 (p = 0.48) | r = 0.35 (p = 0.04) |
| overlap 2nd / 3rd lines | r = 0.39 (p = 0.07) | r = -0.41 (p = 0.96) | r = 0.61 (p = 0) |
6.6 Mapping Out (Dis)similarity Relationships
In this section we visualize the distances between varieties using Multidimensional Scaling (MDS) plots, hierarchical clustering dendrograms, and neighbornet diagrams (Bryant & Moulton 2004).
Create Figure 6.2 plotting distances based on the 3rd line of evidence for the particle placement alternation.
input <- pv_varimp_line$distance.matrix # specify distance matrix name name
fit <- cmdscale(input, eig = TRUE, k = 2) # k is the number of dim
fit.df <- as.data.frame(fit[[1]])
names(fit.df) <- c("x", "y")
fit.df$Variety <- rownames(fit.df) |> as.factor()
ggplot(fit.df, aes(x, y)) +
geom_text_repel(aes(label = Variety),
size = 7,
box.padding = unit(0.5, "lines"), segment.colour = "grey65"
) +
geom_point(size = 5) + # change dot size here
labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
theme_cup() +
theme(axis.title = element_text(size = 15))Next get the MDS for each alternation.
gen_mds <- cmdscale(dfused_genitives_all, eig = TRUE, k = 2) |> # k is the number of dim
first() |>
as.data.frame() |>
rename(x = "V1", y = "V2") |>
rownames_to_column("Variety")
dat_mds <- cmdscale(dfused_datives_all, eig = TRUE, k = 2) |> # k is the number of dim
first() |>
as.data.frame() |>
rename(x = "V1", y = "V2") |>
rownames_to_column("Variety")
pv_mds <- cmdscale(dfused_particles_all, eig = TRUE, k = 2) |> # k is the number of dim
first() |>
as.data.frame() |>
rename(x = "V1", y = "V2") |>
rownames_to_column("Variety")ggplot(gen_mds, aes(x, y)) +
geom_text_repel(aes(label = Variety),
size = 7,
box.padding = unit(0.5, "lines"), segment.colour = "grey65"
) +
geom_point(size = 5) + # change dot size here
labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
theme_cup() +
theme(axis.title = element_text(size = 15))
ggplot(dat_mds, aes(x, y)) +
geom_text_repel(aes(label = Variety),
size = 7,
box.padding = unit(0.5, "lines"), segment.colour = "grey65"
) +
geom_point(size = 5) + # change dot size here
labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
theme_cup() +
theme(axis.title = element_text(size = 15))
ggplot(pv_mds, aes(x, y)) +
geom_text_repel(aes(label = Variety),
size = 7,
box.padding = unit(0.5, "lines"), segment.colour = "grey65"
) +
geom_point(size = 5) + # change dot size here
labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
theme_cup() +
theme(axis.title = element_text(size = 15))Create Figure 6.4 showing the MDS of all lines and alternations
cmdscale(dfused_total_all, eig = TRUE, k = 2) |> # k is the number of dim
first() |>
as.data.frame() |>
rename(x = "V1", y = "V2") |>
rownames_to_column("Variety") |>
ggplot(aes(x, y)) +
geom_text_repel(aes(label = Variety),
size = 7,
box.padding = unit(0.5, "lines"), segment.colour = "grey65"
) +
geom_point(size = 5) + # change dot size here
geom_rect(aes(xmin = -.18, xmax = .4, ymin = -.2, ymax = .1),
fill = NA,
color = "black"
) +
labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
theme_cup() +
theme(axis.title = element_text(size = 15))Create Figure 6.5 hierarchical cluster model of all lines and alternations.
hclust(dfused_total_all, method = "ward.D2") |>
plot()Create Figure 6.6 neighornet model of all lines and alternations. Note that plots will vary slightly from run to run.
nnet <- phangorn::neighborNet(dfused_total_all)
# plot(nnet, "3D") # movable 3D
par("mar" = rep(1, 4))
plot(nnet, "2D")6.7 Discussion
In the discussion we present results of a simulation study validating the VADIS method.
mod_list <- readRDS(here("model_output", "intercept_speaker_glm_list.rds"))
line1 <- VADIS::vadis_line1(mod_list, path = F)
line2 <- VADIS::vadis_line2(mod_list, path = F)
rf_mod_list <- readRDS(here("model_output", "intercept_speaker_rf_list.rds"))
line3 <- VADIS::vadis_line3(rf_mod_list, path = F)Figure 6.7 shows an MDS of distances simulating “speakers” from 5 different “varieties”, based on the compromise distances from 3 lines of evidence.
Code
fused_dist <- analogue::fuse(
line1$distance.matrix,
line2$distance.matrix,
line3$distance.matrix
)
fused_mds <- cmdscale(fused_dist, k = 3) %>%
as.data.frame()
names(fused_mds) <- c("x", "y", "z")
fused_mds <- fused_mds %>%
rownames_to_column("Name") |>
mutate(
# Name = rownames(rank_mds),
Variety = str_replace(Name, "\\w$", "") %>% factor(),
Speaker = str_extract(Name, "\\w$") %>% factor(),
)
levels(fused_mds$Variety) <- LETTERS[1:5]
fused_mds %>%
ggplot(aes(x, y, group = Variety)) +
geom_point(aes(shape = Variety)) +
stat_ellipse(level = .9, linetype = "dashed", type = "norm", color = "grey") +
stat_ellipse(level = .5, linewidth = 1, type = "norm", color = "grey") +
labs(x = "MDS dimension 1", y = "MDS dimension 2") +
theme_cup() +
theme(
legend.position = c(1, 0), legend.justification = c(1, 0),
legend.background = element_rect(color = "black")
)Figure 6.8 shows a cluster analysis dendrogram based on distances from the the second line of evidence.
dd.row <- hclust(line2$distance.matrix, "average") %>%
as.dendrogram()
ddata_x <- ggdendro::dendro_data(dd.row)
labs <- label(ddata_x) %>%
mutate(
Variety = LETTERS[as.integer(str_extract(label, "\\d"))],
label = str_replace(label, "var\\d", Variety)
)
ggplot(segment(ddata_x)) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(
data = labs,
aes(label = label, x = x, y = y),
hjust = 1
) +
# scale_color_manual(values = cols) +
coord_flip() +
theme_dendro()Chapter 7
In this chapter we present the analysis of the ratings experiment.
Load the experimental ratings data.
ratings_df <- here("data", "data_exp_ratings.txt") |>
read.delim() |>
mutate(Variety = factor(Variety, levels = c("BrE", "NZE", "SgE", "IndE")))7.1 Methods
Create Figure 7.1 showing the distribution of corpus model probabilities for each experimental item.
ratings_df |>
select(item, pred) |>
mutate(pred = round(pred, 6)) |>
distinct() |>
mutate(prob = gtools::inv.logit(pred)) |>
ggplot(aes(reorder(item, prob), prob)) +
geom_point() +
ylim(0, 1) +
labs(x = "Item", y = "Corpus model probability") +
theme_cup() +
theme(axis.text.x = element_text(size = 8))Instruction page in Figure 7.2 (click to embiggen).
7.2 Results
Correlations between corpus model predictions and ratings for split variant by variety.
Code
nested_df <- ratings_df |>
arrange(Variety) |>
group_by(Variety) |>
nest()
ylab <- data.frame(
label = "Preference rating for split variant",
x = 1,
y = 1
) |>
ggplot() +
geom_text(aes(x, y, label = label), angle = 90, family = "serif") +
theme_void() +
coord_cartesian(clip = "off")
p <- map(
seq_along(nested_df$data),
.f = function(i) {
d <- nested_df$data[[i]]
p <- d |>
ggplot(aes(pred, rating_raw)) +
geom_smooth(
method = "loess", formula = y ~ x, se = F, color = "black",
linetype = 2
) +
geom_point(alpha = .2, size = .5, color = "grey50") +
labs(
x = "",
y = ""
) +
ggtitle(pull(nested_df, Variety)[i]) +
ylim(0, 100) +
theme_cup()
if ((i %% 2) == 0) {
p <- p + theme(
# axis.title.y = element_blank(),
axis.text.y = element_blank()
)
}
p
}
) |>
wrap_plots()
p <- ylab + p + plot_layout(widths = c(.8, 25)) +
plot_annotation(
caption = "Corpus model predictions for split variant (logit scale)",
theme = theme(plot.caption = element_text(
hjust = 0.5, size = 12,
margin = margin(t = 0),
family = "serif"
))
)
pMean ratings for split variant by variety.
meds <- c(by(ratings_df$rating_raw, ratings_df$Variety, median)) + 8
ratings_df |>
ggplot(aes(Variety, rating_raw)) +
geom_hline(yintercept = 0, linewidth = 1, color = "grey") +
geom_point(
size = 1, alpha = .4, position = position_jitter(width = .2),
color = "darkgray", pch = 16
) +
ggplot2::stat_summary(fun = "median", size = 1.5, color = "black", pch = 18) +
geom_label(
data = data.frame(),
aes(x = names(meds), y = meds, label = meds),
fontface = "bold", size = 5, family = "serif"
) +
labs(x = "", y = "Predicted rating for split variant") +
ggtitle("Variety") +
theme_cup()Correlations between corpus model predictions and ratings for split variant by variety, averaged by stimulus item.
Code
nested_df <- ratings_df |>
arrange(Variety) |>
group_by(Variety) |>
nest() |>
mutate(
data = map(
data,
~ .x |>
group_by(item) |>
summarise(
rating = mean(rating_raw),
pred = median(pred)
)
)
)
ylab <- data.frame(
label = "Mean preference rating for split variant",
x = 1,
y = 1
) |>
ggplot() +
geom_text(aes(x, y, label = label), angle = 90, family = "serif") +
theme_void() +
coord_cartesian(clip = "off")
p <- map(
seq_along(nested_df$data),
.f = function(i) {
d <- nested_df$data[[i]]
p <- d |>
ggplot(aes(pred, rating)) +
geom_smooth(method = "lm", formula = y ~ x, se = F, color = "gray") +
geom_point() +
ggrepel::geom_text_repel(aes(label = item),
size = 4,
segment.color = "grey", family = "serif",
max.overlaps = 15
) +
labs(
x = "",
y = ""
) +
ggtitle(pull(nested_df, Variety)[i]) +
ylim(0, 100) +
theme_cup()
if ((i %% 2) == 0) p <- p + theme(axis.text.y = element_blank())
return(p)
}
) |>
wrap_plots()
p <- ylab + p + plot_layout(widths = c(1, 25)) +
plot_annotation(
caption = "Corpus model predictions for split variant (logit scale)",
theme = theme(plot.caption = element_text(
hjust = 0.5, size = 12,
margin = margin(t = 0),
family = "serif"
))
)
pCreate Table 7.1 showing mean ratings (with standard deviation) per item and variety.
Code
item_means <- ratings_df |>
group_by(item, Variety) |>
summarise(
mean = round(mean(rating_raw), 1),
SD = round(sd(rating_raw, na.rm = T), 1)
) |>
ungroup()
item_mean_tab <- item_means |>
dplyr::filter(Variety == "BrE") |>
rename(Br.Mean = "mean", Br.SD = "SD") |>
select(-Variety) |>
bind_cols(
item_means |>
dplyr::filter(Variety == "NZE") |>
rename(NZ.Mean = "mean", NZ.SD = "SD") |>
select(3:4),
item_means |>
dplyr::filter(Variety == "SgE") |>
rename(Sg.Mean = "mean", Sg.SD = "SD") |>
select(3:4),
item_means |>
dplyr::filter(Variety == "IndE") |>
rename(Ind.Mean = "mean", Ind.SD = "SD") |>
select(3:4)
) |>
rename(Item = "item") |>
arrange(desc(Br.Mean))
knitr::kable(item_mean_tab)| Item | Br.Mean | Br.SD | NZ.Mean | NZ.SD | Sg.Mean | Sg.SD | Ind.Mean | Ind.SD |
|---|---|---|---|---|---|---|---|---|
| B10 | 94.7 | 14.4 | 88.7 | 23.9 | 81.2 | 24.0 | 88.4 | 23.8 |
| B8 | 94.0 | 13.2 | 92.7 | 16.9 | 84.2 | 23.5 | 86.6 | 24.5 |
| A6 | 87.9 | 18.9 | 90.0 | 22.1 | 77.6 | 29.9 | 87.6 | 22.4 |
| A10 | 87.8 | 16.1 | 80.5 | 32.7 | 62.1 | 41.4 | 55.1 | 36.2 |
| B4 | 82.5 | 21.5 | 89.8 | 15.7 | 60.7 | 37.9 | 76.4 | 33.3 |
| A9 | 82.4 | 18.7 | 89.9 | 21.6 | 72.4 | 37.4 | 69.1 | 32.7 |
| B9 | 77.3 | 15.5 | 80.0 | 24.7 | 71.9 | 34.0 | 79.6 | 25.8 |
| A15 | 76.1 | 22.0 | 79.4 | 31.0 | 82.9 | 22.3 | 77.0 | 27.6 |
| B7 | 68.1 | 23.9 | 81.2 | 30.6 | 61.6 | 39.6 | 55.5 | 40.3 |
| A8 | 67.2 | 25.2 | 66.1 | 37.5 | 56.2 | 37.7 | 71.4 | 33.3 |
| B3 | 66.5 | 27.2 | 47.9 | 37.4 | 43.2 | 35.6 | 35.4 | 34.6 |
| B15 | 65.2 | 29.1 | 64.6 | 35.4 | 56.2 | 39.5 | 48.5 | 36.0 |
| A5 | 58.5 | 35.1 | 67.4 | 39.2 | 59.2 | 42.3 | 69.6 | 28.8 |
| A12 | 57.8 | 30.7 | 58.7 | 40.2 | 61.6 | 39.6 | 55.4 | 32.7 |
| B14 | 56.9 | 35.1 | 64.5 | 39.0 | 53.2 | 39.9 | 51.7 | 39.4 |
| A13 | 56.5 | 29.5 | 39.1 | 37.9 | 44.1 | 41.4 | 43.7 | 33.6 |
| B6 | 55.3 | 30.8 | 43.5 | 35.5 | 64.3 | 33.6 | 44.2 | 32.8 |
| A4 | 54.0 | 28.3 | 58.6 | 37.0 | 57.2 | 40.2 | 60.4 | 35.4 |
| A14 | 53.2 | 29.5 | 62.3 | 38.3 | 46.8 | 39.5 | 46.4 | 37.5 |
| B13 | 48.0 | 28.9 | 60.3 | 37.7 | 45.4 | 40.6 | 42.9 | 36.1 |
| B5 | 46.2 | 29.2 | 54.5 | 39.5 | 61.2 | 36.0 | 47.4 | 32.2 |
| B1 | 42.0 | 28.1 | 31.0 | 35.9 | 34.1 | 39.0 | 32.6 | 35.8 |
| A7 | 41.6 | 34.9 | 45.8 | 43.2 | 38.2 | 33.6 | 26.9 | 26.2 |
| B12 | 39.0 | 28.9 | 57.9 | 38.3 | 67.1 | 38.6 | 49.1 | 39.0 |
| A1 | 32.5 | 27.4 | 33.9 | 38.2 | 33.2 | 34.2 | 35.6 | 36.0 |
| A3 | 24.9 | 23.8 | 19.2 | 31.4 | 12.4 | 23.7 | 11.2 | 15.5 |
| B11 | 21.4 | 21.3 | 18.1 | 29.7 | 21.5 | 29.4 | 10.2 | 16.3 |
| A2 | 20.4 | 22.7 | 16.7 | 31.2 | 10.6 | 18.5 | 10.7 | 16.1 |
| A11 | 15.1 | 21.3 | 12.3 | 22.7 | 12.3 | 21.8 | 13.3 | 17.1 |
| B2 | 7.6 | 15.1 | 5.5 | 9.9 | 8.6 | 11.8 | 5.6 | 9.6 |
Statistical Analysis
All continuous predictors were centered and scaled by 2 standard deviations following Gelman (2008). We also center the ratings around the midpoint 50.
Code
ratings_df <- ratings_df |>
mutate(
z.DirObjLettLength = z.(DirObjLettLength, factor = 2),
z.pred_min = z.(pred_new_noLen, factor = 2),
rating_c = rating_raw - 50,
gender = as.factor(gender),
education_simple = factor(education_simple, levels = c("university", "no_university", "postgraduate"))
)We use a custom coding for Variety, which included the following levels:
- Level 1: Inner (BrE and NZE) vs. Outer (IndE and SgE) Circle
- Level 2: BrE vs. NZE (Inner only)
- Level 3: IndE vs. SgE (Outer only)
Use the {hypr} package for finding the right codings.
Code
custom_variety_hyp <- hypr::hypr(
InnerCvsOuterC = BrE + NZE ~ IndE + SgE,
BrEvsNZE = BrE ~ NZE,
IndEvsSgE = IndE ~ SgE,
levels = levels(ratings_df$Variety)
)
custom_variety_hyphypr object containing 3 null hypotheses:
H0.InnerCvsOuterC: 0 = BrE + NZE - IndE - SgE
H0.BrEvsNZE: 0 = BrE - NZE
H0.IndEvsSgE: 0 = IndE - SgE
Call:
hypr(InnerCvsOuterC = ~BrE + NZE - IndE - SgE, BrEvsNZE = ~BrE -
NZE, IndEvsSgE = ~IndE - SgE, levels = c("BrE", "NZE", "SgE",
"IndE"))
Hypothesis matrix (transposed):
InnerCvsOuterC BrEvsNZE IndEvsSgE
BrE 1 1 0
NZE 1 -1 0
SgE -1 0 -1
IndE -1 0 1
Contrast matrix:
InnerCvsOuterC BrEvsNZE IndEvsSgE
BrE 1/4 1/2 0
NZE 1/4 -1/2 0
SgE -1/4 0 -1/2
IndE -1/4 0 1/2
Code
ratings_df$Variety_custom <- ratings_df$Variety
contrasts(ratings_df$Variety_custom) <- hypr::contr.hypothesis(custom_variety_hyp)
contrasts(ratings_df$Variety_custom) InnerCvsOuterC BrEvsNZE IndEvsSgE
BrE 0.25 0.5 0.0
NZE 0.25 -0.5 0.0
SgE -0.25 0.0 -0.5
IndE -0.25 0.0 0.5
We fit a linear mixed model (estimated using REML and nloptwrap optimizer) to predict centered rating [rating_c with variety [Variety_custom], direct object length [z.DirObjLettLength], corpus model prediction [z.pred_min], gender, and education [education_simpl]. The model included second-order interaction terms for variety x direct object length and variety x corpus prediction (formula: rating_c ~ Variety_custom * (z.DirObjLettLength + z.pred_min) + gender + education_simple). The model included random intercepts for participant [id] and stimulus item, along with by-id slopes for the effect of corpus model prediction (z.pred_min) and direct object length (z.DirObjLettLength) (formula: list(~1 | id, ~1 | item, ~0 + z.pred_min | id, ~0 + z.DirObjLettLength | id)).
The final model formula:
ratings_fmla0 <- rating_c ~ (1 | id) +
(1 | item) +
(0 + z.pred_min | id) +
(0 + z.DirObjLettLength | id) +
Variety_custom +
z.DirObjLettLength +
z.pred_min +
Variety_custom:z.DirObjLettLength +
Variety_custom:z.pred_min +
gender +
education_simpleWe fit a model with the demographic predictors, then one without.
rating_mod0 <- lmer(ratings_fmla0, data = ratings_df)# remove demo factors
ratings_fmla1 <- update(ratings_fmla0, . ~ . - gender - education_simple)
rating_mod1 <- lmer(ratings_fmla1, data = ratings_df)Compare the models. No evidence for effect of gender or education, but we report the one with demographic factors.
anova(rating_mod0, rating_mod1)Data: ratings_df
Models:
rating_mod1: ratings_fmla1
rating_mod0: ratings_fmla0
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
rating_mod1 17 38071 38177 -19018 38037
rating_mod0 20 38076 38201 -19018 38036 0.7638 3 0.8581
model_performance(rating_mod0)# Indices of model performance
AIC | AICc | BIC | R2 (cond.) | R2 (marg.) | ICC | RMSE | Sigma
-------------------------------------------------------------------------------------
38018.629 | 38018.846 | 38144.004 | 0.377 | 0.238 | 0.183 | 29.725 | 30.562
Create Table 7.2
Code
y <- lme4::getME(rating_mod0, "y")
smry <- summary(rating_mod0)
logLik <- round(smry$logLik, 2)
aic <- round(AIC(rating_mod0), 2)
r_sq_m <- round(MuMIn::r.squaredGLMM(rating_mod0)[1], 3)
r_sq_c <- round(MuMIn::r.squaredGLMM(rating_mod0)[2], 3)
n <- nrow(getME(rating_mod0, "X"))
fixed_eff_tab <- smry$coefficients |>
as.data.frame() |>
dplyr::select(-3)
rownames(fixed_eff_tab) <- gsub("Variety_custom", "", rownames(fixed_eff_tab))
names(fixed_eff_tab)[4] <- "p-value"
random_eff_tab <- smry$varcor |>
as.data.frame() |>
mutate(grp = str_remove(grp, "\\..*")) |>
rename("Stand.Dev." = "sdcor") |>
mutate(
name = c(
"id: by-DirObjWordLength",
"id: by-CorpusPred",
"id: (Intercept)",
"item: (Intercept)",
"Residual"
)
) |>
select(6, 5) |>
mutate(`Stand.Dev.` = round(Stand.Dev., 2))
data.frame(
`Marginal R2` = r_sq_m,
`Conditional R2` = r_sq_c,
N = n,
LogLikelihood = logLik,
AIC = aic
) |>
knitr::kable()
fixed_eff_tab |>
knitr::kable(
# format = "html",
digits = 3,
label = "fixef"
)
random_eff_tab |>
knitr::kable(
format = "html",
digits = 2,
label = "ranef"
)Table 5.2: Summary statistics, fixed effects coefficients, and random effects standard deviations for linear mixed-model fitting participant rating as function of DIRECT OBJECT LENGTH, CORPUS MODEL PREDICTION, and participant VARIETY, GENDER, and EDUCATION LEVEL.
| Marginal.R2 | Conditional.R2 | N | LogLikelihood | AIC |
|---|---|---|---|---|
| 0.23 | 0.397 | 3900 | -18989.31 | 38018.63 |
| Estimate | Std. Error | t value | p-value | |
|---|---|---|---|---|
| (Intercept) | 3.125 | 2.787 | 1.121 | 0.270 |
| InnerCvsOuterC | 11.552 | 2.731 | 4.230 | 0.000 |
| BrEvsNZE | -0.808 | 1.859 | -0.434 | 0.664 |
| IndEvsSgE | -1.832 | 1.921 | -0.954 | 0.341 |
| z.DirObjLettLength | -13.012 | 5.230 | -2.488 | 0.019 |
| z.pred_min | 33.271 | 5.242 | 6.347 | 0.000 |
| gendermale | 0.925 | 1.307 | 0.707 | 0.480 |
| education_simpleno_university | -0.405 | 1.739 | -0.233 | 0.816 |
| education_simplepostgraduate | -0.710 | 1.651 | -0.430 | 0.667 |
| InnerCvsOuterC:z.DirObjLettLength | -12.023 | 4.449 | -2.702 | 0.007 |
| BrEvsNZE:z.DirObjLettLength | -3.022 | 3.058 | -0.988 | 0.324 |
| IndEvsSgE:z.DirObjLettLength | -7.334 | 3.207 | -2.287 | 0.023 |
| InnerCvsOuterC:z.pred_min | 2.848 | 4.206 | 0.677 | 0.499 |
| BrEvsNZE:z.pred_min | -3.954 | 2.847 | -1.389 | 0.166 |
| IndEvsSgE:z.pred_min | 0.308 | 3.071 | 0.100 | 0.920 |
| name | Stand.Dev. |
|---|---|
| id: by-DirObjWordLength | 7.39 |
| id: by-CorpusPred | 5.00 |
| id: (Intercept) | 6.52 |
| item: (Intercept) | 14.00 |
| Residual | 30.56 |
Plot the partial effects for the Variety on ratings.
Code
effects::Effect(c("Variety_custom"), rating_mod0) |>
as.data.frame() |>
mutate(Variety_custom = factor(Variety_custom,
levels = c("BrE", "NZE", "SgE", "IndE")
)) |>
ggplot(aes(Variety_custom, fit)) +
geom_hline(yintercept = 0, size = 1, color = "grey") +
geom_point(size = 4) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = .1) +
# ylim(-10, 10) +
labs(x = "", y = "Predicted rating (centered at 50)") +
ggtitle("Variety") +
theme_cup() +
theme(
axis.title.y = element_text(size = 20),
axis.text.x = element_text(size = 20),
axis.text.y = element_text(size = 20),
plot.title = element_text(size = 28)
)Plot the partial effects for the Variety X Corpus prediction (Variety_custom:z.pred_min) interaction.
Code
cols <- RColorBrewer::brewer.pal(4, "Set1")
eff_df <- effects::Effect(
focal.predictors = c("z.pred_min", "Variety_custom"),
mod = rating_mod1,
xlevels = 10
) |>
as.data.frame()
nested_df <- eff_df |>
arrange(Variety_custom) |>
group_by(Variety_custom) |>
nest() |>
ungroup()
text_df <- eff_df |>
group_by(Variety_custom) |>
summarise(
x = max(z.pred_min) + .05,
y = max(fit)
) |>
distinct() |>
ungroup()
# single plot with overlapping lines
p1 <- eff_df |>
ggplot(aes(z.pred_min, fit, color = Variety_custom)) +
geom_line(aes(linetype = Variety_custom), size = 1) +
labs(x = "", y = "") +
geom_text(
data = text_df,
aes(x, y, label = Variety_custom, color = Variety_custom),
family = "serif", size = 5, hjust = c(0, 0, 1, 0)
) +
theme_cup() +
scale_color_grey(guide = "none") +
scale_linetype_discrete(guide = "none") +
# xlim(-2.1, 3.8) +
theme(legend.position = "top", legend.text = element_text(size = rel(1.2)))
# faceted plots with confidence intervals
p2 <- map(
seq_along(nested_df$data),
.f = function(i) {
d <- nested_df$data[[i]]
p <- d |>
ggplot(aes(z.pred_min, fit)) +
geom_ribbon(aes(ymin = lower, ymax = upper),
alpha = .3, color = "grey"
) +
geom_line(size = 1) +
labs(
x = "",
y = ""
) +
ggtitle(pull(nested_df, Variety_custom)[i]) +
ylim(-28, 30) +
theme_cup()
return(p)
}
) |>
wrap_plots(ncol = 2)
# plot for label
ylab <- data.frame(
label = "Predicted rating (centered at 50)",
x = 1,
y = 1
) |>
ggplot() +
geom_text(aes(x, y, label = label), angle = 90, family = "serif") +
theme_void() +
coord_cartesian(clip = "off")
p <- ylab + (p1 / p2 + plot_layout(heights = c(20, 30))) +
plot_layout(widths = c(.8, 25)) +
plot_annotation(
caption = "Corpus model predicted log odds",
theme = theme(plot.caption = element_text(
hjust = 0.5, size = 12,
margin = margin(t = 0),
family = "serif"
))
)
pPlot the partial effects for the Variety X Direct Object Length (Variety_custom:z.DirObjLettLength) interaction.
Code
cols <- RColorBrewer::brewer.pal(4, "Set1")
length_levs <- ratings_df |>
pull(z.DirObjLettLength) |>
round(3) |>
unique() |>
sort()
eff_df <- effects::Effect(
focal.predictors = c("z.DirObjLettLength", "Variety_custom"),
mod = rating_mod1,
xlevels = list(z.DirObjLettLength = length_levs)
) |>
as.data.frame() |>
mutate(Variety_custom = factor(Variety_custom,
levels = c("BrE", "NZE", "SgE", "IndE")
))
nested_df <- eff_df |>
arrange(Variety_custom) |>
group_by(Variety_custom) |>
nest() |>
ungroup()
text_df <- eff_df |>
group_by(Variety_custom) |>
summarise(
x = ifelse(Variety_custom == "SgE",
min(z.DirObjLettLength) - .05, max(z.DirObjLettLength) + .05
),
y = ifelse(Variety_custom == "SgE", max(fit), min(fit))
) |>
distinct() |>
ungroup()
# single plot with overlapping lines
p1 <- eff_df |>
ggplot(aes(z.DirObjLettLength, fit, color = Variety_custom)) +
geom_line(aes(linetype = Variety_custom), size = 1) +
labs(x = "", y = "") +
geom_text(
data = text_df,
aes(x, y, label = Variety_custom, color = Variety_custom),
family = "serif", size = 5, hjust = c(0, 0, 1, 0)
) +
theme_cup() +
scale_color_grey(guide = "none") +
scale_linetype_discrete(guide = "none") +
# xlim(-2.1, 3.8) +
theme(legend.position = "top", legend.text = element_text(size = rel(1.2)))
# faceted plots with confidence intervals
p2 <- map(
seq_along(nested_df$data),
.f = function(i) {
d <- nested_df$data[[i]]
p <- d |>
ggplot(aes(z.DirObjLettLength, fit)) +
geom_ribbon(aes(ymin = lower, ymax = upper),
alpha = .3, color = "grey"
) +
geom_line(size = 1) +
labs(
x = "",
y = ""
) +
ggtitle(pull(nested_df, Variety_custom)[i]) +
ylim(-28, 30) +
theme_cup()
return(p)
}
) |>
wrap_plots(ncol = 2)
# plot for label
ylab <- data.frame(
label = "Predicted rating (centered at 50)",
x = 1,
y = 1
) |>
ggplot() +
geom_text(aes(x, y, label = label), angle = 90, family = "serif") +
theme_void() +
coord_cartesian(clip = "off")
p <- ylab + (p1 / p2 + plot_layout(heights = c(20, 30))) +
plot_layout(widths = c(.8, 25)) +
plot_annotation(
caption = "Number of words in direct object\n(z-score standardized)",
theme = theme(plot.caption = element_text(
hjust = 0.5, size = 12,
margin = margin(t = 0),
family = "serif"
))
)
p7.3 Discussion
Create the tables showing the amount of agreement between participants and corpus model.
ratings_df <- ratings_df |>
mutate(
Observed = str_replace_all(Response, "Discontinuous", "Split"),
corpus_pred = ifelse(pred_new > 0, "Split", "Continuous"),
participant_pref = ifelse(rating_c > 0, "Split", "Continuous"),
agree = ifelse(corpus_pred == participant_pref, 1, 0),
Corpus_model_correct = ifelse(corpus_pred == Observed, "Yes", "No")
)
ratings_df |>
group_by(Variety) |>
summarise(
Percent_agree = round(100 * mean(agree), 1)
)?(caption)
# A tibble: 4 × 2
Variety Percent_agree
<fct> <dbl>
1 BrE 69.2
2 NZE 67.4
3 SgE 65.4
4 IndE 67.2
ratings_df |>
group_by(item) |>
summarise(
Mean_rating = round(mean(rating_raw), 1),
SD_rating = round(sd(rating_raw, na.rm = T), 1),
Percent_agree = mean(agree)
) |>
ungroup() |>
arrange(desc(Percent_agree)) |>
inner_join(ratings_df |> select(item, pred, Observed, Corpus_model_correct), by = "item") |>
distinct(item, .keep_all = T) |>
select(item, Mean_rating, pred, Percent_agree, Observed, Corpus_model_correct) |>
rename("Corpus_model_prob" = "pred", Item = "item")?(caption)
# A tibble: 30 × 6
Item Mean_rating Corpus_model_prob Percent_agree Observed
<chr> <dbl> <dbl> <dbl> <chr>
1 B2 6.9 -3.39 0.992 Continuous
2 A11 13.2 -7.30 0.940 Continuous
3 B8 89.1 2.18 0.921 Split
4 A2 15.1 -3.26 0.918 Continuous
5 B11 18.1 -6.83 0.897 Continuous
6 A3 17.6 -1.34 0.896 Continuous
7 B10 87.8 0.799 0.889 Split
8 A6 86.5 1.19 0.888 Split
9 A15 78.8 1.21 0.843 Split
10 A9 80.4 1.56 0.843 Split
# ℹ 20 more rows
# ℹ 1 more variable: Corpus_model_correct <chr>
Show Figure 7.9 of the random forest partial dependencies adapted from Szmrecsanyi et al. (2016).
Appendix
6.4 Quantification of Similarity Coefficients for data subets
Spoken data only
Just the spoken data.
data_gens_spoken <- data_genitives |>
filter(MODE == "spoken") |>
mutate(
PorHeadLemma_pruned = filter_infrequent(POR_HEAD_LEMMA, 20),
PumHeadLemma_pruned = filter_infrequent(PUM_HEAD_LEMMA, 20)
) |>
droplevels()
data_gens_spoken_split <- split(data_gens_spoken, data_gens_spoken$variety_of_E)
data_dats_spoken <- data_datives |>
filter(Mode == "spoken") |>
mutate(
Verb_pruned = filter_infrequent(Verb, 20),
RecHeadPlain_pruned = filter_infrequent(RecHeadPlain, 20),
ThemeHeadPlain_pruned = filter_infrequent(ThemeHeadPlain, 20)
) |>
droplevels()
data_dats_spoken_split <- split(data_dats_spoken, data_dats_spoken$variety_of_E)
data_pv_spoken <- data_particle_verbs |>
filter(grepl("^spok", Register)) |>
mutate(
VerbPart_pruned = filter_infrequent(VerbPart, 20),
Verb_pruned = filter_infrequent(Verb, 20)
) |>
droplevels()
data_pv_spoken_split <- split(data_pv_spoken, data_pv_spoken$variety_of_E)
# remove unnecessary
rm(data_gens_spoken, data_dats_spoken, data_pv_spoken)
invisible(gc(verbose = FALSE)) # free unused memoryStart with the GENITIVE alternation. First run the by-variety regression models.
if (!file.exists(here("model_output", "gen_glmer_list_spk.rds"))) {
genitives_variety_glmer_spk <- vector("list")
for (i in seq_along(data_gens_spoken_split)) {
d <- data_gens_spoken_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
# fit the model
genitives_variety_glmer_spk[[i]] <- glmer(
gen_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(genitives_variety_glmer_spk) <- names(data_gens_spoken_split)
saveRDS(genitives_variety_glmer_spk, here("model_output", "genitives_variety_glmer_spk.rds"))
} else {
genitives_variety_glmer_spk <- readRDS(here("model_output", "genitives_variety_glmer_spk.rds"))
}Check models.
round(VADIS::summary_stats(genitives_variety_glmer_spk), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 478 0.743 0.822 0.116 0.880 0.365 388.948 1.222 1.970
CanE 299 0.609 0.863 0.109 0.916 0.354 235.599 1.294 2.384
IrE 355 0.696 0.842 0.115 0.896 0.363 281.695 1.467 2.043
NZE 381 0.619 0.832 0.126 0.883 0.416 363.869 1.094 2.129
HKE 505 0.703 0.820 0.125 0.879 0.386 426.114 1.344 2.132
IndE 453 0.859 0.914 0.065 0.920 0.223 241.338 1.259 1.701
JamE 412 0.813 0.859 0.096 0.867 0.316 284.277 1.355 2.047
PhlE 545 0.844 0.890 0.079 0.886 0.266 334.841 1.172 1.758
SgE 416 0.680 0.844 0.108 0.915 0.343 328.757 1.247 2.305
HosLem.p
BrE 0.190
CanE 0.217
IrE 0.589
NZE 0.034
HKE 0.754
IndE 0.591
JamE 0.182
PhlE 0.687
SgE 0.633
Now fit random forest for each variety.
if (file.exists(here("model_output", "gen_crf_vadis_list_spk.rds"))) {
genitives_variety_forests_vadis_spk <- readRDS(
here("model_output", "gen_crf_vadis_list_spk.rds")
)
} else {
genitives_variety_forests_vadis_spk <- lapply(
data_gens_spoken_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(genitives_variety_forests_vadis_spk, here("model_output", "genitives_variety_forests_vadis_spk.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
gen_signif_line <- vadis_line1(genitives_variety_glmer_spk,
path = here("model_output", "gens_line1_spk.rds"),
overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer_spk,
path = here("model_output", "gens_line2_spk.rds"),
overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis_spk,
path = here("model_output", "gens_line3_spk.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
gen_mean_sims <- data.frame(
line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = gen_coef_line$similarity.scores[, 1],
line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_gens_spoken_split), "Line Mean")
round(gen_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.722 0.433 0.574 0.576
CanE 0.694 0.477 0.557 0.576
IrE 0.736 0.479 0.625 0.613
NZE 0.667 0.425 0.560 0.550
HKE 0.806 0.183 0.211 0.400
IndE 0.625 0.298 0.461 0.462
JamE 0.736 0.444 0.661 0.614
PhlE 0.736 0.403 0.438 0.525
SgE 0.722 0.493 0.586 0.601
Line Mean 0.716 0.404 0.519 0.546
Next the DATIVE alternation.
if (!file.exists(here("model_output", "dat_glmer_list_spk.rds"))) {
datives_variety_glmer_spk <- vector("list")
for (i in seq_along(data_dats_spoken_split)) {
d <- data_dats_spoken_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
# fit the model
datives_variety_glmer_spk[[i]] <- glmer(
dat_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(datives_variety_glmer_spk) <- names(data_dats_spoken_split)
saveRDS(datives_variety_glmer_spk, here("model_output", "datives_variety_glmer_spk.rds"))
} else {
datives_variety_glmer_spk <- readRDS(here("model_output", "datives_variety_glmer_spk.rds"))
}Check models.
round(VADIS::summary_stats(datives_variety_glmer_spk), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 523 0.744 0.922 0.058 0.957 0.212 279.934 2.512 4.442
CanE 570 0.767 0.911 0.059 0.969 0.189 271.397 2.517 4.022
IrE 521 0.770 0.914 0.061 0.967 0.199 261.081 2.476 4.121
NZE 580 0.734 0.890 0.080 0.947 0.257 354.169 3.004 4.238
HKE 710 0.699 0.911 0.062 0.968 0.204 366.309 2.303 4.218
IndE 643 0.577 0.925 0.061 0.967 0.217 340.888 2.231 4.595
JamE 615 0.792 0.967 0.028 0.987 0.103 188.365 2.485 4.408
PhlE 622 0.711 0.916 0.064 0.961 0.218 328.455 2.949 4.568
SgE 690 0.757 0.919 0.065 0.956 0.223 371.941 2.773 4.228
HosLem.p
BrE 0.329
CanE 0.909
IrE 0.947
NZE 0.496
HKE 0.969
IndE 0.521
JamE 0.000
PhlE 0.790
SgE 0.688
Now fit random forest for each variety.
if (file.exists(here("model_output", "dat_crf_vadis_list_spk.rds"))) {
datives_variety_forests_vadis_spk <- readRDS(
here("model_output", "dat_crf_vadis_list_spk.rds")
)
} else {
datives_variety_forests_vadis_spk <- lapply(
data_dats_spoken_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(datives_variety_forests_vadis_spk, here("model_output", "datives_variety_forests_vadis_spk.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
dat_signif_line <- vadis_line1(datives_variety_glmer_spk,
path = here("model_output", "dats_line1_spk.rds"),
overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer_spk,
path = here("model_output", "dats_line2_spk.rds"),
overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis_spk,
path = here("model_output", "dats_line3_spk.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
dat_mean_sims <- data.frame(
line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = dat_coef_line$similarity.scores[, 1],
line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_dats_spoken_split), "Line Mean")
round(dat_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.797 0.603 0.506 0.635
CanE 0.703 0.511 0.714 0.643
IrE 0.797 0.565 0.649 0.670
NZE 0.641 0.573 0.640 0.618
HKE 0.781 0.565 0.545 0.630
IndE 0.672 0.509 0.628 0.603
JamE 0.781 -0.058 0.637 0.453
PhlE 0.594 0.570 0.646 0.603
SgE 0.734 0.556 0.536 0.609
Line Mean 0.722 0.488 0.611 0.607
Finally, the PARTICLE PLACEMENT alternation.
if (!file.exists(here("model_output", "pv_glmer_list_spk.rds"))) {
pv_variety_glmer_spk <- vector("list")
for (i in seq_along(data_pv_spoken_split)) {
d <- data_pv_spoken_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
# fit the model
pv_variety_glmer_spk[[i]] <- glmer(
pv_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(pv_variety_glmer_spk) <- names(data_pv_spoken_split)
saveRDS(pv_variety_glmer_spk, here("model_output", "pv_variety_glmer_spk.rds"))
} else {
pv_variety_glmer_spk <- readRDS(here("model_output", "pv_variety_glmer_spk.rds"))
}Check models.
round(VADIS::summary_stats(pv_variety_glmer_spk), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 573 0.557 0.787 0.146 0.870 0.446 595.582 1.196 1.799
CanE 669 0.556 0.806 0.135 0.887 0.421 646.380 1.160 1.897
IrE 674 0.614 0.773 0.156 0.843 0.472 724.131 1.226 1.838
NZE 713 0.560 0.827 0.122 0.906 0.384 659.498 1.188 1.744
HKE 493 0.811 0.870 0.098 0.886 0.314 362.358 1.216 1.853
IndE 409 0.866 0.914 0.062 0.928 0.211 229.046 1.352 2.158
JamE 478 0.808 0.872 0.092 0.897 0.299 363.115 1.312 1.886
PhlE 366 0.839 0.902 0.072 0.922 0.238 224.632 1.370 1.885
SgE 527 0.793 0.867 0.093 0.907 0.295 368.461 1.270 1.874
HosLem.p
BrE 0.811
CanE 0.489
IrE 0.764
NZE 0.104
HKE 0.761
IndE 0.914
JamE 0.401
PhlE 0.940
SgE 0.797
Now fit random forest for each variety.
if (file.exists(here("model_output", "pv_crf_vadis_list_spk.rds"))) {
pv_variety_forests_vadis_spk <- readRDS(
here("model_output", "pv_crf_vadis_list_spk.rds")
)
} else {
pv_variety_forests_vadis_spk <- lapply(
data_pv_spoken_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(pv_variety_forests_vadis_spk, here("model_output", "pv_variety_forests_vadis_spk.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
pv_signif_line <- vadis_line1(pv_variety_glmer_spk,
path = here("model_output", "pv_line1_spk.rds"),
overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer_spk,
path = here("model_output", "pv_line2_spk.rds"),
overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis_spk,
path = here("model_output", "pv_line3_spk.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
pv_mean_sims <- data.frame(
line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = pv_coef_line$similarity.scores[, 1],
line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_pv_spoken_split), "Line Mean")
round(pv_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.734 0.726 0.765 0.742
CanE 0.734 0.688 0.795 0.739
IrE 0.719 0.716 0.798 0.744
NZE 0.672 0.705 0.771 0.716
HKE 0.703 0.731 0.804 0.746
IndE 0.531 0.353 0.503 0.462
JamE 0.766 0.687 0.643 0.699
PhlE 0.719 0.645 0.750 0.705
SgE 0.734 0.678 0.708 0.707
Line Mean 0.701 0.659 0.726 0.695
Now create cummary table.
tab <- data.frame(
Genitives = unlist(gen_mean_sims[10, ]),
Datives = unlist(dat_mean_sims[10, ]),
`Particle placement` = unlist(pv_mean_sims[10, ])
) |>
round(2)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
tab?(caption)
Genitives Datives Particle.placement
1st line 0.72 0.72 0.70
2nd line 0.40 0.49 0.66
3rd line 0.52 0.61 0.73
mean 0.55 0.61 0.70
Core Grammar Score for spoken data only:
round(rowMeans(tab[4, ]), 3)mean
0.62
# clean up
rm(genitives_variety_forests_vadis_spk, datives_variety_forests_vadis_spk, pv_variety_forests_vadis_spk)
invisible(gc(verbose = FALSE)) # free unused memoryWritten data only
Just the written data.
data_gens_written <- data_genitives |>
filter(MODE == "written") |>
mutate(
PorHeadLemma_pruned = filter_infrequent(POR_HEAD_LEMMA, 20),
PumHeadLemma_pruned = filter_infrequent(PUM_HEAD_LEMMA, 20)
) |>
droplevels()
data_gens_written_split <- split(data_gens_written, data_gens_written$variety_of_E)
data_dats_written <- data_datives |>
filter(Mode == "written") |>
mutate(
Verb_pruned = filter_infrequent(Verb, 20),
RecHeadPlain_pruned = filter_infrequent(RecHeadPlain, 20),
ThemeHeadPlain_pruned = filter_infrequent(ThemeHeadPlain, 20)
) |>
droplevels()
data_dats_written_split <- split(data_dats_written, data_dats_written$variety_of_E)
data_pv_written <- data_particle_verbs |>
filter(grepl("^spok", Register)) |>
mutate(
VerbPart_pruned = filter_infrequent(VerbPart, 20),
Verb_pruned = filter_infrequent(Verb, 20)
) |>
droplevels()
data_pv_written_split <- split(data_pv_written, data_pv_written$variety_of_E)
# remove unnecessary
rm(data_gens_written, data_dats_written, data_pv_written)
invisible(gc(verbose = FALSE)) # free unused memoryStart with the GENITIVE alternation. First run the by-variety regression models.
if (!file.exists(here("model_output", "gen_glmer_list_wrt.rds"))) {
genitives_variety_glmer_wrt <- vector("list")
for (i in seq_along(data_gens_written_split)) {
d <- data_gens_written_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
# fit the model
genitives_variety_glmer_wrt[[i]] <- glmer(
gen_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(genitives_variety_glmer_wrt) <- names(data_gens_written_split)
saveRDS(genitives_variety_glmer_wrt, here("model_output", "genitives_variety_glmer_wrt.rds"))
} else {
genitives_variety_glmer_wrt <- readRDS(here("model_output", "genitives_variety_glmer_wrt.rds"))
}Check models.
round(VADIS::summary_stats(genitives_variety_glmer_wrt), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 1134 0.723 0.863 0.099 0.915 0.321 798.379 1.292 1.979
CanE 1061 0.618 0.846 0.112 0.916 0.352 806.000 1.281 2.225
IrE 958 0.706 0.849 0.113 0.898 0.355 735.456 1.149 1.981
NZE 1486 0.713 0.861 0.103 0.906 0.332 1076.406 1.166 1.917
HKE 1024 0.668 0.857 0.096 0.935 0.305 708.852 1.125 2.159
IndE 1383 0.717 0.863 0.098 0.918 0.317 966.937 1.139 2.000
JamE 895 0.753 0.884 0.086 0.929 0.280 579.985 1.283 1.979
PhlE 1204 0.663 0.826 0.120 0.899 0.375 968.461 1.100 2.022
SgE 809 0.644 0.852 0.112 0.912 0.359 666.935 1.355 2.420
HosLem.p
BrE 0.021
CanE 0.619
IrE 0.569
NZE 0.002
HKE 0.422
IndE 0.628
JamE 0.759
PhlE 0.679
SgE 0.268
Now fit random forest for each variety.
if (file.exists(here("model_output", "gen_crf_vadis_list_wrt.rds"))) {
genitives_variety_forests_vadis_wrt <- readRDS(
here("model_output", "gen_crf_vadis_list_wrt.rds")
)
} else {
genitives_variety_forests_vadis_wrt <- lapply(
data_gens_written_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(genitives_variety_forests_vadis_wrt, here("model_output", "genitives_variety_forests_vadis_wrt.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
gen_signif_line <- vadis_line1(genitives_variety_glmer_wrt,
path = here("model_output", "gens_line1_wrt.rds"),
overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer_wrt,
path = here("model_output", "gens_line2_wrt.rds"),
overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis_wrt,
path = here("model_output", "gens_line3_wrt.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
gen_mean_sims <- data.frame(
line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = gen_coef_line$similarity.scores[, 1],
line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_gens_written_split), "Line Mean")
round(gen_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.903 0.627 0.836 0.789
CanE 0.833 0.687 0.872 0.797
IrE 0.833 0.717 0.827 0.793
NZE 0.861 0.694 0.762 0.772
HKE 0.833 0.653 0.818 0.768
IndE 0.792 0.630 0.756 0.726
JamE 0.903 0.561 0.804 0.756
PhlE 0.861 0.540 0.789 0.730
SgE 0.903 0.660 0.857 0.806
Line Mean 0.858 0.641 0.813 0.771
Next the DATIVE alternation.
if (!file.exists(here("model_output", "dat_glmer_list_wrt.rds"))) {
datives_variety_glmer_wrt <- vector("list")
for (i in seq_along(data_dats_written_split)) {
d <- data_dats_written_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
# fit the model
datives_variety_glmer_wrt[[i]] <- glmer(
dat_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(datives_variety_glmer_wrt) <- names(data_dats_written_split)
saveRDS(datives_variety_glmer_wrt, here("model_output", "datives_variety_glmer_wrt.rds"))
} else {
datives_variety_glmer_wrt <- readRDS(here("model_output", "datives_variety_glmer_wrt.rds"))
}Check models.
round(VADIS::summary_stats(datives_variety_glmer_wrt), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 795 0.683 0.870 0.085 0.946 0.274 521.633 2.346 3.882
CanE 787 0.675 0.892 0.077 0.958 0.246 476.893 1.764 3.645
IrE 749 0.698 0.877 0.085 0.942 0.280 504.271 1.974 4.051
NZE 905 0.696 0.884 0.083 0.943 0.273 585.645 1.771 3.581
HKE 1058 0.598 0.868 0.091 0.943 0.297 731.952 1.927 3.787
IndE 907 0.596 0.885 0.085 0.950 0.279 611.356 2.002 4.092
JamE 760 0.645 0.893 0.084 0.951 0.274 492.324 2.330 3.948
PhlE 883 0.626 0.905 0.071 0.965 0.231 491.132 1.905 3.931
SgE 853 0.671 0.869 0.090 0.945 0.282 554.641 2.278 3.897
HosLem.p
BrE 0.414
CanE 0.754
IrE 0.627
NZE 0.454
HKE 0.014
IndE 0.042
JamE 0.921
PhlE 0.889
SgE 0.173
Now fit random forest for each variety.
if (file.exists(here("model_output", "dat_crf_vadis_list_wrt.rds"))) {
datives_variety_forests_vadis_wrt <- readRDS(
here("model_output", "dat_crf_vadis_list_wrt.rds")
)
} else {
datives_variety_forests_vadis_wrt <- lapply(
data_dats_written_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(datives_variety_forests_vadis_wrt, here("model_output", "datives_variety_forests_vadis_wrt.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
dat_signif_line <- vadis_line1(datives_variety_glmer_wrt,
path = here("model_output", "dats_line1_wrt.rds"),
overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer_wrt,
path = here("model_output", "dats_line2_wrt.rds"),
overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis_wrt,
path = here("model_output", "dats_line3_wrt.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
dat_mean_sims <- data.frame(
line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = dat_coef_line$similarity.scores[, 1],
line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_dats_written_split), "Line Mean")
round(dat_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.781 0.647 0.938 0.789
CanE 0.766 0.730 0.917 0.804
IrE 0.766 0.698 0.938 0.800
NZE 0.641 0.729 0.842 0.737
HKE 0.688 0.736 0.905 0.776
IndE 0.719 0.736 0.920 0.792
JamE 0.703 0.648 0.890 0.747
PhlE 0.531 0.571 0.815 0.639
SgE 0.781 0.738 0.920 0.813
Line Mean 0.708 0.693 0.898 0.766
Finally, the PARTICLE PLACEMENT alternation.
if (!file.exists(here("model_output", "pv_glmer_list_wrt.rds"))) {
pv_variety_glmer_wrt <- vector("list")
for (i in seq_along(data_pv_written_split)) {
d <- data_pv_written_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
# fit the model
pv_variety_glmer_wrt[[i]] <- glmer(
pv_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(pv_variety_glmer_wrt) <- names(data_pv_written_split)
saveRDS(pv_variety_glmer_wrt, here("model_output", "pv_variety_glmer_wrt.rds"))
} else {
pv_variety_glmer_wrt <- readRDS(here("model_output", "pv_variety_glmer_wrt.rds"))
}Check models.
round(VADIS::summary_stats(pv_variety_glmer_wrt), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 573 0.557 0.787 0.146 0.870 0.446 595.582 1.196 1.799
CanE 669 0.556 0.806 0.135 0.887 0.421 646.380 1.160 1.897
IrE 674 0.614 0.773 0.156 0.843 0.472 724.131 1.226 1.838
NZE 713 0.560 0.827 0.122 0.906 0.384 659.498 1.188 1.744
HKE 493 0.811 0.870 0.098 0.886 0.314 362.358 1.216 1.853
IndE 409 0.866 0.914 0.062 0.928 0.211 229.046 1.352 2.158
JamE 478 0.808 0.872 0.092 0.897 0.299 363.115 1.312 1.886
PhlE 366 0.839 0.902 0.072 0.922 0.238 224.632 1.370 1.885
SgE 527 0.793 0.867 0.093 0.907 0.295 368.461 1.270 1.874
HosLem.p
BrE 0.811
CanE 0.489
IrE 0.764
NZE 0.104
HKE 0.761
IndE 0.914
JamE 0.401
PhlE 0.940
SgE 0.797
Now fit random forest for each variety.
if (file.exists(here("model_output", "pv_crf_vadis_list_wrt.rds"))) {
pv_variety_forests_vadis_wrt <- readRDS(
here("model_output", "pv_crf_vadis_list_wrt.rds")
)
} else {
pv_variety_forests_vadis_wrt <- lapply(
data_pv_written_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(pv_variety_forests_vadis_wrt, here("model_output", "pv_variety_forests_vadis_wrt.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
pv_signif_line <- vadis_line1(pv_variety_glmer_wrt,
path = here("model_output", "pv_line1_wrt.rds"),
overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer_wrt,
path = here("model_output", "pv_line2_wrt.rds"),
overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis_wrt,
path = here("model_output", "pv_line3_wrt.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
pv_mean_sims <- data.frame(
line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = pv_coef_line$similarity.scores[, 1],
line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_pv_written_split), "Line Mean")
round(pv_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.734 0.726 0.762 0.741
CanE 0.734 0.688 0.771 0.731
IrE 0.719 0.716 0.789 0.741
NZE 0.672 0.705 0.777 0.718
HKE 0.703 0.731 0.762 0.732
IndE 0.531 0.353 0.539 0.474
JamE 0.766 0.687 0.616 0.690
PhlE 0.719 0.645 0.732 0.699
SgE 0.734 0.678 0.711 0.708
Line Mean 0.701 0.659 0.718 0.693
Create summary table.
tab <- data.frame(
Genitives = unlist(gen_mean_sims[10, ]),
Datives = unlist(dat_mean_sims[10, ]),
`Particle placement` = unlist(pv_mean_sims[10, ])
) |>
round(2)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
tab?(caption)
Genitives Datives Particle.placement
1st line 0.86 0.71 0.70
2nd line 0.64 0.69 0.66
3rd line 0.81 0.90 0.72
mean 0.77 0.77 0.69
Core Grammar Score for written data only:
round(rowMeans(tab[4, ]), 3) mean
0.743
# clean up
rm(genitives_variety_forests_vadis_wrt, datives_variety_forests_vadis_wrt, pv_variety_forests_vadis_wrt)
invisible(gc(verbose = FALSE)) # free unused memoryInner Circle only
Just the Inner Circle data.
data_gens_inner <- data_genitives |>
filter(Variety %in% c("gb", "ire", "nz", "can")) |>
mutate(
PorHeadLemma_pruned = filter_infrequent(POR_HEAD_LEMMA, 20),
PumHeadLemma_pruned = filter_infrequent(PUM_HEAD_LEMMA, 20)
) |>
droplevels()
data_gens_inner_split <- split(data_gens_inner, data_gens_inner$variety_of_E)
data_dats_inner <- data_datives |>
filter(Variety %in% c("GB", "CAN", "NZ", "IRE")) |>
mutate(
Verb_pruned = filter_infrequent(Verb, 20),
RecHeadPlain_pruned = filter_infrequent(RecHeadPlain, 20),
ThemeHeadPlain_pruned = filter_infrequent(ThemeHeadPlain, 20)
) |>
droplevels()
data_dats_inner_split <- split(data_dats_inner, data_dats_inner$variety_of_E)
data_pv_inner <- data_particle_verbs |>
filter(Variety %in% c("GB", "CA", "NZ", "IE")) |>
mutate(
VerbPart_pruned = filter_infrequent(VerbPart, 20),
Verb_pruned = filter_infrequent(Verb, 20)
) |>
droplevels()
data_pv_inner_split <- split(data_pv_inner, data_pv_inner$variety_of_E)
# remove unnecessary
rm(data_gens_inner, data_dats_inner, data_pv_inner)
invisible(gc(verbose = FALSE)) # free unused memoryStart with the GENITIVE alternation. First run the by-variety regression models.
if (!file.exists(here("model_output", "gen_glmer_list_inner.rds"))) {
genitives_variety_glmer_inner <- vector("list")
for (i in seq_along(data_gens_inner_split)) {
d <- data_gens_inner_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
# fit the model
genitives_variety_glmer_inner[[i]] <- glmer(
gen_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(genitives_variety_glmer_inner) <- names(data_gens_inner_split)
saveRDS(genitives_variety_glmer_inner, here("model_output", "genitives_variety_glmer_inner.rds"))
} else {
genitives_variety_glmer_inner <- readRDS(here("model_output", "genitives_variety_glmer_inner.rds"))
}Check models.
round(VADIS::summary_stats(genitives_variety_glmer_inner), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 1612 0.729 0.848 0.107 0.901 0.342 1173.156 1.206 1.958
CanE 1360 0.616 0.846 0.113 0.912 0.358 1028.416 1.232 2.243
IrE 1313 0.703 0.836 0.114 0.899 0.359 1007.834 1.168 1.978
NZE 1867 0.694 0.844 0.112 0.895 0.361 1434.009 1.119 1.940
HosLem.p
BrE 0.073
CanE 0.586
IrE 0.360
NZE 0.079
Now fit random forest for each variety.
if (file.exists(here("model_output", "gen_crf_vadis_list_inner.rds"))) {
genitives_variety_forests_vadis_inner <- readRDS(
here("model_output", "gen_crf_vadis_list_inner.rds")
)
} else {
genitives_variety_forests_vadis_inner <- lapply(
data_gens_inner_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(genitives_variety_forests_vadis_inner, here("model_output", "genitives_variety_forests_vadis_inner.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
gen_signif_line <- vadis_line1(genitives_variety_glmer_inner,
path = here("model_output", "gens_line1_inner.rds"),
overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer_inner,
path = here("model_output", "gens_line2_inner.rds"),
overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis_inner,
path = here("model_output", "gens_line3_inner.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
gen_mean_sims <- data.frame(
line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = gen_coef_line$similarity.scores[, 1],
line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_gens_inner_split), "Line Mean")
round(gen_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.926 0.694 0.913 0.844
CanE 0.926 0.753 0.960 0.880
IrE 0.926 0.768 0.960 0.885
NZE 0.926 0.750 0.960 0.879
Line Mean 0.926 0.741 0.948 0.872
Next the DATIVE alternation.
if (!file.exists(here("model_output", "dat_glmer_list_inner.rds"))) {
datives_variety_glmer_inner <- vector("list")
for (i in seq_along(data_dats_inner_split)) {
d <- data_dats_inner_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
# fit the model
datives_variety_glmer_inner[[i]] <- glmer(
dat_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(datives_variety_glmer_inner) <- names(data_dats_inner_split)
saveRDS(datives_variety_glmer_inner, here("model_output", "datives_variety_glmer_inner.rds"))
} else {
datives_variety_glmer_inner <- readRDS(here("model_output", "datives_variety_glmer_inner.rds"))
}Check models.
round(VADIS::summary_stats(datives_variety_glmer_inner), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 1318 0.707 0.888 0.080 0.946 0.264 780.790 2.529 4.009
CanE 1357 0.713 0.898 0.073 0.959 0.234 728.641 2.110 3.759
IrE 1270 0.728 0.890 0.078 0.949 0.257 740.895 2.158 4.025
NZE 1485 0.711 0.890 0.083 0.943 0.271 897.885 2.211 3.747
HosLem.p
BrE 0.676
CanE 0.803
IrE 0.693
NZE 0.030
Now fit random forest for each variety.
if (file.exists(here("model_output", "dat_crf_vadis_list_inner.rds"))) {
datives_variety_forests_vadis_inner <- readRDS(
here("model_output", "dat_crf_vadis_list_inner.rds")
)
} else {
datives_variety_forests_vadis_inner <- lapply(
data_dats_inner_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(datives_variety_forests_vadis_inner, here("model_output", "datives_variety_forests_vadis_inner.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
dat_signif_line <- vadis_line1(datives_variety_glmer_inner,
path = here("model_output", "dats_line1_inner.rds"),
overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer_inner,
path = here("model_output", "dats_line2_inner.rds"),
overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis_inner,
path = here("model_output", "dats_line3_inner.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
dat_mean_sims <- data.frame(
line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = dat_coef_line$similarity.scores[, 1],
line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_dats_inner_split), "Line Mean")
round(dat_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.708 0.760 0.659 0.709
CanE 0.625 0.799 0.587 0.670
IrE 0.625 0.776 0.683 0.695
NZE 0.792 0.839 0.611 0.747
Line Mean 0.688 0.793 0.635 0.705
Finally, the PARTICLE PLACEMENT alternation.
if (!file.exists(here("model_output", "pv_glmer_list_inner.rds"))) {
pv_variety_glmer_inner <- vector("list")
for (i in seq_along(data_pv_inner_split)) {
d <- data_pv_inner_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
# fit the model
pv_variety_glmer_inner[[i]] <- glmer(
pv_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(pv_variety_glmer_inner) <- names(data_pv_inner_split)
saveRDS(pv_variety_glmer_inner, here("model_output", "pv_variety_glmer_inner.rds"))
} else {
pv_variety_glmer_inner <- readRDS(here("model_output", "pv_variety_glmer_inner.rds"))
}Check models.
round(VADIS::summary_stats(pv_variety_glmer_inner), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
BrE 1324 0.709 0.825 0.120 0.885 0.376 1122.600 1.159 1.699
CanE 1392 0.693 0.821 0.120 0.894 0.370 1173.338 1.154 1.821
IrE 1347 0.734 0.828 0.117 0.886 0.364 1124.159 1.174 1.714
NZE 1543 0.686 0.842 0.114 0.902 0.362 1275.064 1.136 1.666
HosLem.p
BrE 0.456
CanE 0.273
IrE 0.022
NZE 0.455
Now fit random forest for each variety.
if (file.exists(here("model_output", "pv_crf_vadis_list_inner.rds"))) {
pv_variety_forests_vadis_inner <- readRDS(
here("model_output", "pv_crf_vadis_list_inner.rds")
)
} else {
pv_variety_forests_vadis_inner <- lapply(
data_pv_inner_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(pv_variety_forests_vadis_inner, here("model_output", "pv_variety_forests_vadis_inner.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
pv_signif_line <- vadis_line1(pv_variety_glmer_inner,
path = here("model_output", "pv_line1_inner.rds"),
overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer_inner,
path = here("model_output", "pv_line2_inner.rds"),
overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis_inner,
path = here("model_output", "pv_line3_inner.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
pv_mean_sims <- data.frame(
line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = pv_coef_line$similarity.scores[, 1],
line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_pv_inner_split), "Line Mean")
round(pv_mean_sims, 3) line1 line2 line3 Variety_Mean
BrE 0.833 0.850 0.897 0.860
CanE 0.833 0.818 0.865 0.839
IrE 0.667 0.807 0.825 0.766
NZE 0.833 0.834 0.889 0.852
Line Mean 0.792 0.827 0.869 0.829
Create summary table.
tab <- data.frame(
Genitives = unlist(gen_mean_sims[5, ]),
Datives = unlist(dat_mean_sims[5, ]),
`Particle placement` = unlist(pv_mean_sims[5, ])
) |>
round(2)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
tab?(caption)
Genitives Datives Particle.placement
1st line 0.93 0.69 0.79
2nd line 0.74 0.79 0.83
3rd line 0.95 0.63 0.87
mean 0.87 0.71 0.83
Core Grammar Score for Inner Circle data only:
round(rowMeans(tab[4, ]), 3) mean
0.803
# clean up
rm(genitives_variety_forests_vadis_inner, datives_variety_forests_vadis_inner, pv_variety_forests_vadis_inner)
invisible(gc(verbose = FALSE)) # free unused memoryOuter Circle only
Just the Outer Circle data.
data_gens_outer <- data_genitives |>
filter(Variety %in% c("ja", "sin", "hk", "phi", "ind")) |>
mutate(
PorHeadLemma_pruned = filter_infrequent(POR_HEAD_LEMMA, 20),
PumHeadLemma_pruned = filter_infrequent(PUM_HEAD_LEMMA, 20)
) |>
droplevels()
data_gens_outer_split <- split(data_gens_outer, data_gens_outer$variety_of_E)
data_dats_outer <- data_datives |>
filter(Variety %in% c("JA", "SIN", "HK", "PHI", "IND")) |>
mutate(
Verb_pruned = filter_infrequent(Verb, 20),
RecHeadPlain_pruned = filter_infrequent(RecHeadPlain, 20),
ThemeHeadPlain_pruned = filter_infrequent(ThemeHeadPlain, 20)
) |>
droplevels()
data_dats_outer_split <- split(data_dats_outer, data_dats_outer$variety_of_E)
data_pv_outer <- data_particle_verbs |>
filter(Variety %in% c("JA", "SG", "HK", "PH", "IN")) |>
mutate(
VerbPart_pruned = filter_infrequent(VerbPart, 20),
Verb_pruned = filter_infrequent(Verb, 20)
) |>
droplevels()
data_pv_outer_split <- split(data_pv_outer, data_pv_outer$variety_of_E)
# remove unnecessary
rm(data_gens_outer, data_dats_outer, data_pv_outer)
invisible(gc(verbose = FALSE)) # free unused memoryStart with the GENITIVE alternation. First run the by-variety regression models.
if (!file.exists(here("model_output", "gen_glmer_list_outer.rds"))) {
genitives_variety_glmer_outer <- vector("list")
for (i in seq_along(data_gens_outer_split)) {
d <- data_gens_outer_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
# fit the model
genitives_variety_glmer_outer[[i]] <- glmer(
gen_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(genitives_variety_glmer_outer) <- names(data_gens_outer_split)
saveRDS(genitives_variety_glmer_outer, here("model_output", "genitives_variety_glmer_outer.rds"))
} else {
genitives_variety_glmer_outer <- readRDS(here("model_output", "genitives_variety_glmer_outer.rds"))
}Check models.
round(VADIS::summary_stats(genitives_variety_glmer_outer), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
HKE 1529 0.680 0.846 0.110 0.911 0.346 1145.435 1.135 2.128
IndE 1836 0.752 0.871 0.093 0.916 0.303 1208.988 1.120 1.923
JamE 1307 0.772 0.871 0.092 0.910 0.299 859.519 1.237 1.981
PhlE 1749 0.719 0.847 0.111 0.898 0.352 1315.057 1.090 1.912
SgE 1225 0.656 0.847 0.115 0.906 0.367 986.384 1.240 2.365
HosLem.p
HKE 0.552
IndE 0.953
JamE 0.271
PhlE 0.624
SgE 0.007
Now fit random forest for each variety.
if (file.exists(here("model_output", "gen_crf_vadis_list_outer.rds"))) {
genitives_variety_forests_vadis_outer <- readRDS(
here("model_output", "gen_crf_vadis_list_outer.rds")
)
} else {
genitives_variety_forests_vadis_outer <- lapply(
data_gens_outer_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(genitives_variety_forests_vadis_outer, here("model_output", "genitives_variety_forests_vadis_outer.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
gen_signif_line <- vadis_line1(genitives_variety_glmer_outer,
path = here("model_output", "gens_line1_outer.rds"),
overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer_outer,
path = here("model_output", "gens_line2_outer.rds"),
overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis_outer,
path = here("model_output", "gens_line3_outer.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
gen_mean_sims <- data.frame(
line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = gen_coef_line$similarity.scores[, 1],
line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_gens_outer_split), "Line Mean")
round(gen_mean_sims, 3) line1 line2 line3 Variety_Mean
HKE 0.889 0.645 0.738 0.757
IndE 0.889 0.615 0.738 0.747
JamE 0.917 0.664 0.726 0.769
PhlE 0.833 0.571 0.774 0.726
SgE 0.917 0.687 0.810 0.804
Line Mean 0.889 0.636 0.757 0.761
Next the DATIVE alternation.
if (!file.exists(here("model_output", "dat_glmer_list_outer.rds"))) {
datives_variety_glmer_outer <- vector("list")
for (i in seq_along(data_dats_outer_split)) {
d <- data_dats_outer_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
# fit the model
datives_variety_glmer_outer[[i]] <- glmer(
dat_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(datives_variety_glmer_outer) <- names(data_dats_outer_split)
saveRDS(datives_variety_glmer_outer, here("model_output", "datives_variety_glmer_outer.rds"))
} else {
datives_variety_glmer_outer <- readRDS(here("model_output", "datives_variety_glmer_outer.rds"))
}Check models.
round(VADIS::summary_stats(datives_variety_glmer_outer), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
HKE 1768 0.639 0.890 0.081 0.952 0.265 1061.763 2.005 3.929
IndE 1550 0.588 0.895 0.076 0.954 0.263 936.315 2.056 4.241
JamE 1375 0.711 0.925 0.062 0.966 0.211 673.547 2.466 4.107
PhlE 1505 0.661 0.907 0.069 0.962 0.232 796.372 2.159 4.134
SgE 1543 0.709 0.887 0.079 0.951 0.256 885.884 2.447 4.001
HosLem.p
HKE 0.210
IndE 0.010
JamE 0.032
PhlE 0.903
SgE 0.370
Now fit random forest for each variety.
if (file.exists(here("model_output", "dat_crf_vadis_list_outer.rds"))) {
datives_variety_forests_vadis_outer <- readRDS(
here("model_output", "dat_crf_vadis_list_outer.rds")
)
} else {
datives_variety_forests_vadis_outer <- lapply(
data_dats_outer_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(datives_variety_forests_vadis_outer, here("model_output", "datives_variety_forests_vadis_outer.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
dat_signif_line <- vadis_line1(datives_variety_glmer_outer,
path = here("model_output", "dats_line1_outer.rds"),
overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer_outer,
path = here("model_output", "dats_line2_outer.rds"),
overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis_outer,
path = here("model_output", "dats_line3_outer.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
dat_mean_sims <- data.frame(
line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = dat_coef_line$similarity.scores[, 1],
line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_dats_outer_split), "Line Mean")
round(dat_mean_sims, 3) line1 line2 line3 Variety_Mean
HKE 0.688 0.727 0.810 0.741
IndE 0.781 0.680 0.673 0.711
JamE 0.656 0.562 0.810 0.676
PhlE 0.656 0.684 0.661 0.667
SgE 0.719 0.693 0.810 0.740
Line Mean 0.700 0.669 0.752 0.707
Finally, the PARTICLE PLACEMENT alternation.
if (!file.exists(here("model_output", "pv_glmer_list_outer.rds"))) {
pv_variety_glmer_outer <- vector("list")
for (i in seq_along(data_pv_outer_split)) {
d <- data_pv_outer_split[[i]]
# standardize the model inputs, excluding the response and random effects
d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
# fit the model
pv_variety_glmer_outer[[i]] <- glmer(
pv_f4,
data = d_std, family = binomial, control = glmer_ctrl
)
rm(d, d_std) # remove datasets
}
names(pv_variety_glmer_outer) <- names(data_pv_outer_split)
saveRDS(pv_variety_glmer_outer, here("model_output", "pv_variety_glmer_outer.rds"))
} else {
pv_variety_glmer_outer <- readRDS(here("model_output", "pv_variety_glmer_outer.rds"))
}Check models.
round(VADIS::summary_stats(pv_variety_glmer_outer), 3) N baseline predicted.corr Brier C LogScore AIC Max.VIF kappa
HKE 1208 0.851 0.872 0.087 0.882 0.281 758.224 1.186 1.767
IndE 1083 0.902 0.921 0.064 0.879 0.219 542.674 1.269 1.928
JamE 1106 0.854 0.879 0.088 0.868 0.288 713.929 1.246 1.726
PhlE 1004 0.865 0.885 0.083 0.882 0.266 596.010 1.215 1.686
SgE 1333 0.848 0.884 0.085 0.894 0.275 832.543 1.202 1.689
HosLem.p
HKE 0.733
IndE 0.982
JamE 0.760
PhlE 0.665
SgE 0.042
Now fit random forest for each variety.
if (file.exists(here("model_output", "pv_crf_vadis_list_outer.rds"))) {
pv_variety_forests_vadis_outer <- readRDS(
here("model_output", "pv_crf_vadis_list_outer.rds")
)
} else {
pv_variety_forests_vadis_outer <- lapply(
data_pv_outer_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
)
saveRDS(pv_variety_forests_vadis_outer, here("model_output", "pv_variety_forests_vadis_outer.rds"))
}Run the VADIS analyses for lines 1, 2 and 3.
pv_signif_line <- vadis_line1(pv_variety_glmer_outer,
path = here("model_output", "pv_line1_outer.rds"),
overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer_outer,
path = here("model_output", "pv_line2_outer.rds"),
overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis_outer,
path = here("model_output", "pv_line3_outer.rds"),
conditional = FALSE,
overwrite = "reload"
)Now look at the mean values by line and variety.
pv_mean_sims <- data.frame(
line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
line2 = pv_coef_line$similarity.scores[, 1],
line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_pv_outer_split), "Line Mean")
round(pv_mean_sims, 3) line1 line2 line3 Variety_Mean
HKE 0.750 0.788 0.679 0.739
IndE 0.531 0.703 0.601 0.612
JamE 0.750 0.777 0.625 0.717
PhlE 0.781 0.772 0.601 0.718
SgE 0.688 0.793 0.744 0.741
Line Mean 0.700 0.766 0.650 0.705
Create summary table.
tab <- data.frame(
Genitives = unlist(gen_mean_sims[6, ]),
Datives = unlist(dat_mean_sims[6, ]),
`Particle placement` = unlist(pv_mean_sims[6, ])
) |>
round(2)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
tab?(caption)
Genitives Datives Particle.placement
1st line 0.89 0.70 0.70
2nd line 0.64 0.67 0.77
3rd line 0.76 0.75 0.65
mean 0.76 0.71 0.71
Core Grammar Score for Outer Circle data only:
round(rowMeans(tab[4, ]), 3) mean
0.727
# clean up
rm(genitives_variety_forests_vadis_outer, datives_variety_forests_vadis_outer, pv_variety_forests_vadis_outer)
invisible(gc(verbose = FALSE)) # free unused memory6.5 Evaluating Coherence
By-line pairwise correlations
gen_line12 <- vegan::mantel(gen_signif_line$distance.matrix, gen_coef_line$distance.matrix)
gen_line13 <- vegan::mantel(gen_signif_line$distance.matrix, gen_varimp_line$distance.matrix)
gen_line23 <- vegan::mantel(gen_coef_line$distance.matrix, gen_varimp_line$distance.matrix)
dat_line12 <- vegan::mantel(dat_signif_line$distance.matrix, dat_coef_line$distance.matrix)
dat_line13 <- vegan::mantel(dat_signif_line$distance.matrix, dat_varimp_line$distance.matrix)
dat_line23 <- vegan::mantel(dat_coef_line$distance.matrix, dat_varimp_line$distance.matrix)
pv_line12 <- vegan::mantel(pv_signif_line$distance.matrix, pv_coef_line$distance.matrix)
pv_line13 <- vegan::mantel(pv_signif_line$distance.matrix, pv_varimp_line$distance.matrix)
pv_line23 <- vegan::mantel(pv_coef_line$distance.matrix, pv_varimp_line$distance.matrix)7.2 Experimental results
Below are plots of residuals and random effects adjustments for the model of experimental ratings.
Residuals
Check normality of residuals.
Random effects
We can check the BLUPs for the random effects to make sure they are normal.
Code
item_df <- lme4::ranef(rating_mod1)[["item"]] |>
as.data.frame() |>
tidyr::pivot_longer(cols = everything()) |>
mutate(
group = "item",
name = paste0("item|", name)
)
id_df <- lme4::ranef(rating_mod1)[["id"]] |>
as.data.frame() |>
tidyr::pivot_longer(cols = everything()) |>
mutate(
group = "item",
name = paste0("id|", name),
name = str_replace(name, "z.DirObjWordLength", "DirObjWordLength"),
name = str_replace(name, "z.pred_min", "CorpusPred")
)
full_df <- bind_rows(
id_df,
item_df
)
nested_df <- full_df |>
group_by(name) |>
nest()
p <- map(
seq_along(nested_df$data),
.f = function(i) {
d <- nested_df$data[[i]]
p <- d |>
ggplot(aes(sample = value)) +
geom_qq() +
geom_qq_line() +
labs(
x = "",
y = ""
) +
theme_cup() +
ggtitle(pull(nested_df, name)[i])
return(p)
}
) |>
wrap_plots(ncol = 2)
ylab <- data.frame(
label = "Theoretical",
x = 1,
y = 1
) |>
ggplot() +
geom_text(aes(x, y, label = label), angle = 90) +
theme_void() +
coord_cartesian(clip = "off")
p <- ylab + p + plot_layout(widths = c(1, 25)) +
plot_annotation(
caption = "Observed",
theme = theme(plot.caption = element_text(
hjust = 0.5, size = 12,
margin = margin(t = 0),
family = "serif"
))
)
pSession Info
Most recent session info for the analysis.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16 ucrt)
os Windows 10 x64 (build 19045)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United Kingdom.utf8
ctype English_United Kingdom.utf8
tz Europe/London
date 2023-07-14
pandoc 3.1.4 @ C:/Users/grafmilj/AppData/Local/Pandoc/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.3.0)
analogue * 0.17-6 2021-06-20 [1] CRAN (R 4.3.1)
ape * 5.7-1 2023-03-13 [1] CRAN (R 4.3.1)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.3.0)
bayestestR 0.13.1 2023-04-07 [1] CRAN (R 4.3.1)
boot 1.3-28.1 2022-11-22 [1] CRAN (R 4.3.1)
brglm 0.7.2 2021-04-22 [1] CRAN (R 4.3.1)
cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.1)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.1)
car * 3.1-2 2023-03-30 [1] CRAN (R 4.3.1)
carData * 3.0-5 2022-01-06 [1] CRAN (R 4.3.1)
caret * 6.0-94 2023-03-21 [1] CRAN (R 4.3.1)
checkmate 2.2.0 2023-04-27 [1] CRAN (R 4.3.1)
class 7.3-22 2023-05-03 [1] CRAN (R 4.3.1)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1)
cluster 2.1.4 2022-08-22 [2] CRAN (R 4.3.1)
coda 0.19-4 2020-09-30 [1] CRAN (R 4.3.1)
codetools 0.2-19 2023-02-01 [1] CRAN (R 4.3.0)
coin 1.4-2 2021-10-08 [1] CRAN (R 4.3.1)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.1)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1)
data.table 1.14.8 2023-02-17 [1] CRAN (R 4.3.1)
datawizard 0.8.0 2023-06-16 [1] CRAN (R 4.3.1)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.3.1)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.1)
digest 0.6.32 2023-06-26 [1] CRAN (R 4.3.1)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.1)
e1071 1.7-13 2023-02-01 [1] CRAN (R 4.3.1)
effects 4.2-2 2022-07-13 [1] CRAN (R 4.3.1)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.1)
emmeans 1.8.7 2023-06-23 [1] CRAN (R 4.3.1)
estimability 1.4.1 2022-08-05 [1] CRAN (R 4.3.0)
evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.1)
extrafont * 0.19 2023-01-18 [1] CRAN (R 4.3.0)
extrafontdb 1.0 2012-06-11 [1] CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.1)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1)
fastmatch 1.1-3 2021-07-23 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.1)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.1)
foreign 0.8-84 2022-12-06 [1] CRAN (R 4.3.0)
Formula 1.2-5 2023-02-24 [1] CRAN (R 4.3.0)
fs 1.6.2 2023-04-25 [1] CRAN (R 4.3.1)
future 1.33.0 2023-07-01 [1] CRAN (R 4.3.1)
future.apply 1.11.0 2023-05-21 [1] CRAN (R 4.3.1)
gdata 2.19.0 2023-05-05 [1] CRAN (R 4.3.1)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1)
ggdendro * 0.1.23 2022-02-16 [1] CRAN (R 4.3.1)
ggeffects * 1.2.3 2023-06-11 [1] CRAN (R 4.3.1)
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.1)
ggrepel * 0.9.3 2023-02-03 [1] CRAN (R 4.3.1)
globals 0.16.2 2022-11-21 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1)
gmodels * 2.18.1.1 2022-05-17 [1] CRAN (R 4.3.1)
gower 1.0.1 2022-12-22 [1] CRAN (R 4.3.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.3.1)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.1)
gtools 3.9.4 2022-11-27 [1] CRAN (R 4.3.1)
hardhat 1.3.0 2023-03-30 [1] CRAN (R 4.3.1)
haven 2.5.3 2023-06-30 [1] CRAN (R 4.3.1)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.1)
highr 0.10 2022-12-22 [1] CRAN (R 4.3.1)
Hmisc * 5.1-0 2023-05-08 [1] CRAN (R 4.3.1)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.1)
htmlTable 2.4.1 2022-07-07 [1] CRAN (R 4.3.1)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.1)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.1)
httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.3.1)
httr 1.4.6 2023-05-08 [1] CRAN (R 4.3.1)
hypr * 0.2.3 2022-08-08 [1] CRAN (R 4.3.1)
igraph 1.5.0 2023-06-16 [1] CRAN (R 4.3.1)
insight 0.19.3 2023-06-29 [1] CRAN (R 4.3.1)
ipred 0.9-14 2023-03-09 [1] CRAN (R 4.3.1)
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.1)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.1)
kableExtra * 1.3.4 2021-02-20 [1] CRAN (R 4.3.1)
knitr * 1.43 2023-05-25 [1] CRAN (R 4.3.1)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)
later 1.3.1 2023-05-02 [1] CRAN (R 4.3.1)
lattice * 0.21-8 2023-04-05 [1] CRAN (R 4.2.3)
lava 1.7.2.1 2023-02-27 [1] CRAN (R 4.3.1)
libcoin 1.0-9 2021-09-27 [1] CRAN (R 4.3.1)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.1)
listenv 0.9.0 2022-12-16 [1] CRAN (R 4.3.1)
lme4 * 1.1-34 2023-07-04 [1] CRAN (R 4.3.1)
lmerTest * 3.1-3 2020-10-23 [1] CRAN (R 4.3.1)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1)
MASS 7.3-60 2023-05-04 [1] CRAN (R 4.3.1)
Matrix * 1.5-4.1 2023-05-18 [1] CRAN (R 4.3.1)
matrixStats 1.0.0 2023-06-02 [1] CRAN (R 4.3.1)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.1)
mgcv 1.8-42 2023-03-02 [1] CRAN (R 4.3.1)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.1)
minqa 1.2.5 2022-10-19 [1] CRAN (R 4.3.1)
mitools 2.4 2019-04-26 [1] CRAN (R 4.3.1)
ModelMetrics 1.2.2.2 2020-03-17 [1] CRAN (R 4.3.1)
modeltools * 0.2-23 2020-03-05 [1] CRAN (R 4.3.0)
multcomp 1.4-25 2023-06-20 [1] CRAN (R 4.3.1)
MuMIn * 1.47.5 2023-03-22 [1] CRAN (R 4.3.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.1)
mvtnorm * 1.2-2 2023-06-08 [1] CRAN (R 4.3.1)
nlme 3.1-162 2023-01-31 [1] CRAN (R 4.3.1)
nloptr 2.0.3 2022-05-26 [1] CRAN (R 4.3.1)
nnet 7.3-19 2023-05-03 [1] CRAN (R 4.3.1)
numDeriv 2016.8-1.1 2019-06-06 [1] CRAN (R 4.3.0)
optimx * 2022-4.30 2022-05-10 [1] CRAN (R 4.3.1)
parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.3.0)
parameters * 0.21.1 2023-05-26 [1] CRAN (R 4.3.1)
party * 1.3-13 2023-03-17 [1] CRAN (R 4.3.1)
patchwork * 1.1.2 2022-08-19 [1] CRAN (R 4.3.1)
performance * 0.10.4 2023-06-02 [1] CRAN (R 4.3.1)
permimp * 1.0-2 2021-09-13 [1] CRAN (R 4.3.1)
permute * 0.9-7 2022-01-27 [1] CRAN (R 4.3.1)
phangorn * 2.11.1 2023-01-23 [1] CRAN (R 4.3.1)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1)
pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.1)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1)
pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.3.1)
plyr 1.8.8 2022-11-11 [1] CRAN (R 4.3.1)
png 0.1-8 2022-11-29 [1] CRAN (R 4.3.0)
pracma 2.4.2 2022-09-22 [1] CRAN (R 4.3.1)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.3.1)
princurve 2.1.6 2021-01-18 [1] CRAN (R 4.3.1)
pROC 1.18.4 2023-07-06 [1] CRAN (R 4.3.1)
processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.1)
prodlim 2023.03.31 2023-04-02 [1] CRAN (R 4.3.1)
profileModel 0.6.1 2021-01-08 [1] CRAN (R 4.3.1)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.1)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.1)
proxy 0.4-27 2022-06-09 [1] CRAN (R 4.3.1)
ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.1)
purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.1)
quadprog 1.5-8 2019-11-20 [1] CRAN (R 4.3.0)
R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1)
R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1)
randomForest 4.7-1.1 2022-05-23 [1] CRAN (R 4.3.1)
ranger 0.15.1 2023-04-03 [1] CRAN (R 4.3.1)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.3.0)
Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.1)
recipes 1.0.6 2023-04-25 [1] CRAN (R 4.3.1)
remotes * 2.4.2 2021-11-30 [1] CRAN (R 4.3.1)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.3.1)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.1)
rmarkdown 2.23 2023-07-01 [1] CRAN (R 4.3.1)
rpart 4.1.19 2022-10-21 [2] CRAN (R 4.3.1)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.1)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.1)
Rttf2pt1 1.3.12 2023-01-22 [1] CRAN (R 4.3.0)
rvest 1.0.3 2022-08-19 [1] CRAN (R 4.3.1)
sandwich * 3.0-2 2022-06-15 [1] CRAN (R 4.3.1)
scales * 1.2.1 2022-08-20 [1] CRAN (R 4.3.1)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1)
shiny 1.7.4.1 2023-07-06 [1] CRAN (R 4.3.1)
sjlabelled 1.2.0 2022-04-10 [1] CRAN (R 4.3.1)
snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.3.1)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.1)
strucchange * 1.5-3 2022-06-15 [1] CRAN (R 4.3.1)
styler 1.10.1 2023-06-05 [1] CRAN (R 4.3.1)
survey 4.2-1 2023-05-03 [1] CRAN (R 4.3.1)
survival 3.5-5 2023-03-12 [1] CRAN (R 4.3.1)
svglite 2.1.1 2023-01-10 [1] CRAN (R 4.3.1)
systemfonts 1.0.4 2022-02-11 [1] CRAN (R 4.3.1)
TH.data 1.1-2 2023-04-17 [1] CRAN (R 4.3.1)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.1)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.1)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.1)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.1)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.1)
timeDate 4022.108 2023-01-07 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.1)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.1)
usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.1)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.1)
VADIS * 1.0.1 2023-07-14 [1] local
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.1)
vegan * 2.6-4 2022-10-11 [1] CRAN (R 4.3.1)
viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.3.1)
webshot 0.5.5 2023-06-26 [1] CRAN (R 4.3.1)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.1)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.1)
xml2 1.3.5 2023-07-06 [1] CRAN (R 4.3.1)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.1)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
zoo * 1.8-12 2023-04-13 [1] CRAN (R 4.3.1)
[1] C:/Users/grafmilj/Documents/R-4.0/libraries
[2] C:/Program Files/R/R-4.3.1/library
──────────────────────────────────────────────────────────────────────────────